CN116974805A

CN116974805A - Root cause determination method, apparatus and storage medium

Info

Publication number: CN116974805A
Application number: CN202310938899.0A
Authority: CN
Inventors: 李也; 谭剑; 吴斌; 何晓; 李飞飞
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-10-31

Abstract

The application provides a root cause determining method, equipment and a storage medium, wherein the method comprises the following steps: according to various indexes associated with a plurality of service interfaces called in an application system by a target user access request with abnormal target indexes, determining a plurality of original factor nodes related to the target index abnormality of the target user access request, and decomposing the plurality of original factor nodes according to influence functions corresponding to the original factor nodes to obtain a plurality of first sub-factor nodes which correspond to the original factor nodes and are influenced by other factor nodes and a plurality of second sub-factor nodes which are not influenced by other factor nodes. And determining contribution values of the plurality of second sub-factor nodes to the target index abnormality which leads to the target user access request, and accordingly determining root factor nodes which lead to the target index abnormality of the target user access request from the plurality of original factor nodes, so as to realize accurate diagnosis of root factors in a micro-service system.

Description

Root cause determination method, apparatus and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a root cause determining method, apparatus, and storage medium.

Background

The micro-service system, also called micro-service architecture, is a software architecture approach that constructs applications into a series of modular, small autonomous services. In a micro-service system, each service is self-contained and implements a single service function. In short, an application system is divided into a plurality of sub-service systems according to functions, each sub-service system is complete and can operate independently, and the sub-service systems can communicate through a plurality of communication protocols, message queues and the like. Therefore, different sub-service systems may be implemented using different programming languages and may use different storage technologies.

Although micro-service systems are widely adopted because of their modular development, independent deployment, compatibility with multiple languages, etc. However, the operation and maintenance of the micro service system brings many challenges: such as the intricacies of dependency between different sub-service systems; the types of faults existing in hardware, software, networks and the like are various; a long-time anomaly of a certain sub-service system may cause the unavailability of the whole application system in an avalanche manner; as the micro-service system adopts a multi-language multi-frame distributed development mode, operation and maintenance personnel are required to be familiar with the design of the whole system, and the operation and maintenance threshold is high.

When a user accesses the micro-service system for an unfavorable user experience, such as an excessively long response time to an access request, it is necessary to find the root cause of the abnormality. The root cause is found in a plurality of sub-service systems by means of a large amount of manpower, and the cause of the abnormality of the root factor service system is further located, so that the method is time-consuming and labor-consuming, and the root cause can not be found accurately.

Disclosure of Invention

The embodiment of the invention provides a root cause determining method, equipment and a storage medium, which are used for rapidly and accurately diagnosing root causes of target index abnormality which leads to user access requests in an application system.

In a first aspect, an embodiment of the present invention provides a root cause determining method, where the method includes:

determining a plurality of original factor nodes related to target index abnormality of a target user access request according to a plurality of indexes associated with a plurality of service interfaces called in an application system by the target user access request with the target index abnormality, wherein the plurality of indexes comprise the target index;

determining influence functions corresponding to the plurality of original factor nodes respectively, wherein the influence functions are used for describing the dependency relationship among different factor nodes;

Decomposing the plurality of original factor nodes according to the influence functions corresponding to the plurality of original factor nodes to obtain a plurality of first sub-factor nodes which correspond to the plurality of original factor nodes and are influenced by other factor nodes and a plurality of second sub-factor nodes which are not influenced by other factor nodes;

determining contribution values of the plurality of second sub-factor nodes, wherein the contribution values reflect contribution degrees of the corresponding second sub-factor nodes to target index abnormality causing the target user access request;

and determining a root factor node which causes the abnormality of the target index value of the target user access request from the plurality of original factor nodes according to the contribution values of the plurality of second sub-factor nodes.

In a second aspect, an embodiment of the present invention provides a root cause determining apparatus, the apparatus including:

the factor establishing module is used for determining a plurality of original factor nodes related to the target index abnormality of the target user access request according to a plurality of indexes associated with a plurality of service interfaces called in an application system by the target user access request with the abnormal target index, wherein the plurality of indexes comprise the target index;

The factor decomposition module is used for determining influence functions corresponding to the plurality of original factor nodes respectively, and the influence functions are used for describing the dependency relationship among different factor nodes; decomposing the plurality of original factor nodes according to the influence functions corresponding to the plurality of original factor nodes to obtain a plurality of first sub-factor nodes which correspond to the plurality of original factor nodes and are influenced by other factor nodes and a plurality of second sub-factor nodes which are not influenced by other factor nodes;

the root cause determining module is used for determining contribution values of the plurality of second sub-factor nodes, and the contribution values reflect contribution degrees of the corresponding second sub-factor nodes to target index abnormality causing the target user access request; and determining a factor node causing the abnormality of the target index value of the target user access request from the plurality of original factor nodes according to the contribution values of the plurality of second sub-factor nodes.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon, which when executed by the processor, causes the processor to at least implement the root cause determination method as described in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to at least implement a root cause determination method as described in the first aspect.

In the embodiment of the invention, firstly, a certain target user access request of which the root cause is determined, wherein the target object is a certain target index abnormality, and the target index abnormality refers to that RT is overlarge. And then, determining a plurality of original factor nodes related to the abnormality of the target index of the target user access request according to a plurality of indexes (including the target index) related to a plurality of service interfaces called in the application system by the target user access request, so as to determine the root cause of the abnormality of the target index of the target user access request in the original factor nodes, wherein the service interfaces are interfaces of the service called in the execution process of the target user access request, each service interface can be respectively related with different indexes, and one service interface related with one index is used as one original factor node. And determining influence functions corresponding to the original factor nodes respectively, wherein the influence functions are used for describing the dependency relationship among different factor nodes, so that the original factor nodes are decomposed according to the influence functions corresponding to the original factor nodes respectively, and a plurality of first sub-factor nodes which are influenced by other factor nodes and a plurality of second sub-factor nodes which are not influenced by other factor nodes and correspond to the original factor nodes are obtained. The decomposition of the factor nodes actually considers the causal relationship, namely the dependency relationship, among the factor nodes, and the root factor node is the factor node which is not affected by other factor nodes, so that only the part which is not affected by other factor nodes, namely a plurality of second sub-factor nodes, is reserved by decomposing the original factor nodes. And determining contribution values of the plurality of second sub-factor nodes to the target index abnormality causing the target user access request, so as to determine root factor nodes causing the target index abnormality of the target user access request from the plurality of original factor nodes according to the contribution values of the plurality of second sub-factor nodes.

The method for automatically determining the root cause can find the service interface of the root cause in a plurality of service interfaces, further locate the cause of the abnormality of the service interface of the root cause, and can complete the root cause determination more quickly. By taking the influence between factor nodes into consideration and separating the factor nodes into a part influenced by other factor nodes and a part related to the own state only, locating the root cause based on the part related to the own state only, a more accurate root cause determination result can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a root cause determining method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a causal graph provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a factor node decomposition scenario according to an embodiment of the present invention;

FIG. 4 is a flowchart of a root cause determining method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for determining a contribution value of a sub-factor node according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a determination process of a contribution value of a sub-factor node corresponding to the embodiment shown in FIG. 5;

FIG. 7 is a flowchart of a method for determining a contribution value of a sub-factor node according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating an application of a root cause determining method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a root cause determining apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the embodiments of the present invention are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. In the case where there is no conflict between the embodiments, the following embodiments and features in the embodiments may be combined with each other. In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.

Terms or concepts related to the embodiments of the present invention will be explained first:

xia Puli value (shape value): from an economic concept, is a way to systematically distribute contributions and benefits among multiple participants.

Value function: the cost function used by Shapley value.

Counterfactual reasoning (counterfactual evaluation): refers to the act of negating and re-characterizing facts that have occurred in the past to construct a probabilistic assumption of thought. The output results in the hypothetical scenario are inferred simply by a given functional model. For example, the input may be a fictitious system state (e.g., a state in which the response time of a certain service interface in a fictitious system is high), and the output may be a value of a target index of a diagnostic object in the fictitious system state, such as what the response time of a certain user access request is in the state in which the response time of the service interface is high.

The root cause determining method provided by the embodiment of the invention can be executed by an electronic device, and in practical application, the electronic device can be a server, a user terminal such as a PC (personal computer) and the like, and the server can be a cloud server or a server of a user terminal.

The scheme provided by the embodiment of the invention can be applied to the scene of diagnosing the abnormal root cause of the application system formed by a plurality of services. The application system may be a micro-service system, and accordingly, the plurality of services are a plurality of services which are contained in the micro-service system and may be distributed for deployment and development. The micro-service system can be deployed in the cloud, and different services can be deployed in virtual machines built in different physical servers in the cloud. Each service is externally provided with a service interface which can be called by a user or other services.

Fig. 1 is a flowchart of a root cause determining method according to an embodiment of the present invention, as shown in fig. 1, where the method includes the following steps:

101. and determining a plurality of original factor nodes related to the target index abnormality of the target user access request according to a plurality of indexes associated with a plurality of service interfaces called by the target user access request with the target index abnormality in the application system, wherein the plurality of indexes comprise the target index.

102. And determining influence functions corresponding to the plurality of original factor nodes respectively, wherein the influence functions are used for describing the dependency relationship among different factor nodes.

103. And decomposing the plurality of original factor nodes according to the influence functions corresponding to the plurality of original factor nodes to obtain a plurality of first sub-factor nodes which are influenced by other factor nodes and a plurality of second sub-factor nodes which are not influenced by other factor nodes and correspond to the plurality of original factor nodes.

104. And determining contribution values of the plurality of second sub-factor nodes, wherein the contribution values reflect contribution degrees of the corresponding second sub-factor nodes to target index abnormality causing the target user access request.

105. And determining a root factor node which causes the abnormality of the target index value of the target user access request from the plurality of original factor nodes according to the contribution values of the plurality of second sub-factor nodes.

Root cause determination first requires determination of a target object. In this embodiment, the target object is: a target user access request with abnormal target index. The target metrics may be predefined metrics that evaluate performance of the user access request, such as RT being greater than a set threshold (e.g., 5 seconds).

In practice, in an application system such as a micro-service system, the application system may actually include a plurality of services (or called sub-services) deployed in a distributed manner, and different services may be deployed in different virtual machines in the cloud. The user access request in the embodiment of the invention is the access request to the application system. Because different services provide service interfaces to the outside, although a user access request is initially called a service interface of a certain service in an application system, in the execution process of the user access request, a calling relationship between different services exists, namely, the user access request needs to be called to different service interfaces so as to finally finish the execution of the user access request, and a corresponding response result is fed back to the user. Taking the index of RT as an example, the time from the triggering of the user access request to the receiving of the response result is RT.

Based on the composition and working process of the application system, optionally, in practical application, if the target index of a user access request is found to be abnormal, the user access request can be determined as a target object needing root cause determination. Alternatively, if it is found that the above target metrics for a plurality of user access requests are all abnormal within a certain period of time, it may be determined that the target object includes all or part of the plurality of user access requests.

Or optionally, if it is determined that the target index of the target service interface in the application system is abnormal in the target time period, acquiring a target user access request for calling the target service interface in the target time period, where a plurality of service interfaces called by the target user access request include the target service interface.

In the above alternative manner, statistics of target indexes may be performed for each service interface in the application system according to a set time slice (for example, 1 minute), for example, statistics of average RT values of each service interface per minute may be performed, the target time period may include multiple time slices, for example, if the target time period is the last 1 hour, if it is determined by statistical analysis that RT of the target service interface within the 1 hour presents an abnormal phenomenon, a user access request that invokes the target service interface within the 1 hour is obtained, and all or part of the obtained user access request may be used as a target user access request to perform root cause determination. In practice, the statistical process needs to be implemented using log data generated in the application system, and index data and call relationship data extracted from the log data.

In the embodiment of the invention, the determined target object needing root cause determination is called: a target user access request with abnormal target index.

After determining a target user access request with abnormal target indexes, firstly, determining multiple indexes associated with multiple service interfaces called by the target user access request with abnormal target indexes in an application system, and further determining multiple original factor nodes related to the target index abnormality of the target user access request according to the multiple indexes associated with the multiple service interfaces, wherein the multiple indexes comprise the target indexes.

In short, in determining what causes the target index of the target user access request to be abnormal, index statistics of a plurality of service interfaces invoked in the execution of the target user access request need to be integrated to determine. Optionally, assuming that the access trigger time of the target user access request is T1, the time period corresponding to the index statistics of the plurality of service interfaces is corresponding to T1, and the correspondence may be expressed as: if the index statistics is performed on each service interface in the application system in a preset time slice (for example, 1 minute), the index statistics results of the plurality of service interfaces in the time slice corresponding to T1 can be obtained. For example, the counted metrics include RT, then an average RT value for each service interface in the time slice is determined. That is, the statistics herein mean that the average value, the maximum value, the minimum value, or the like of a certain index in the time slot is calculated.

In addition, in practical application, the indexes that can be counted for each service interface may be multiple, but the multiple indexes need to include the target index that the target user access request is abnormal in performance, because the root judgment is currently performed for the condition that the target index is abnormal.

In an alternative embodiment, to more fully consider factors in the application system that may cause abnormality of the target index of the target user access request, determining the multiple indexes associated with the multiple service interfaces may include:

determining at least one performance index corresponding to each service interface in the plurality of service interfaces, wherein the at least one performance index comprises the target index;

at least one performance index of a machine corresponding to each of the plurality of service interfaces is determined.

The at least one performance index corresponding to each service interface may include, for example, an index describing performance of the service interface, such as RT, a request per second (Queries Per Second, abbreviated as QPS), an Error Rate (ER), and the like.

In practical application, different services in an application system can be deployed in different machines (machines), and the machines can be physical machines or virtual machines built in the physical machines, and performance of the machines can also influence performance of corresponding services, so that at least one performance index of the machines where the corresponding services are located can also be considered in indexes associated with service interfaces. The at least one performance indicator of the machine includes, for example, CPU utilization, memory utilization, service load conditions (e.g., number of service threads).

In fact, the service may further use some middleware, such as a database, a message queue, etc., during the execution, so the various indexes associated with the service interfaces may further include performance indexes of the middleware, such as RT corresponding to querying the database. Since these middleware are also deployed on a machine-by-machine basis, they can also be considered as part of at least one performance index of the machine to which the service interface corresponds.

After obtaining multiple indexes associated with multiple service interfaces called in the application system by the target user access request with abnormal target indexes, multiple original factor nodes related to the abnormal target indexes of the target user access request can be determined according to the multiple indexes associated with the service interfaces. Each original factor node may be considered to be composed of an index object and an index, where the index object is a statistical object corresponding to each performance index, for example, each service interface and each machine, and the index is at least one performance index. In addition, based on the current statistical result of the values of the performance indexes, an index value of each performance index can be obtained. Since the target index of the target user access request is abnormal at this time and the original factor nodes are all factors that may cause the target index to be abnormal, the current index values of the performance indexes are regarded as abnormal index values.

Based on the above-mentioned determination of at least one performance index corresponding to each service interface in the plurality of service interfaces and at least one performance index of a machine corresponding to each service interface, a target performance index including a target service interface and a target performance index of a machine corresponding to the target service interface in the plurality of source factor nodes may be determined, where the target service interface is any one of the plurality of service interfaces, the target performance index of the target service interface is any one of the at least one performance indexes corresponding to the target service interface, and the target performance index of the machine corresponding to the target service interface is any one of the at least one performance indexes of the machine corresponding to the target service interface. For example, the primitive node includes RT corresponding to the service interface x, QPS corresponding to the service interface x, ER corresponding to the service interface x, the number of service threads corresponding to the service interface x, CPU utilization/memory utilization of the machine corresponding to the service interface x, and so on. The service interface x is one of a plurality of service interfaces.

And then, determining influence functions corresponding to the plurality of original factor nodes, wherein the influence functions are used for describing the dependency relationship among different factor nodes.

Specifically, after the plurality of primitive factor nodes are determined, the plurality of primitive factor nodes may be output, and a call relationship between the plurality of service interfaces may also be output, where the call relationship may be extracted by a specific call relationship extraction tool, for example, from log data generated during the execution of the target user access request. Then, the influence functions corresponding to the different original factor nodes which are manually input can be received.

In fact, for the plurality of primitive factor nodes, the possible dependency relationships among some of the primitive factor nodes are easy to be clarified, for example, a physical model (a physical rule for reflecting the mutual influence among the primitive factor nodes) between the primitive factor nodes can be easily constructed, and an influence function for describing the dependency relationships can be manually determined based on the physical model, and then the influence function can be manually given directly. The influence functions which cannot be determined manually can be learned in advance through a machine learning model. Or, the corresponding influence function can be obtained by directly adopting a machine learning model aiming at all the original factor nodes.

For the mode of obtaining the influence functions corresponding to different original factor nodes through the machine learning model, a causal graph corresponding to the original factor nodes can be determined first, and the causal graph and the historical index data are input into the machine learning model to train the machine learning model so as to enable the machine learning model to learn and output the influence functions corresponding to the original factor nodes. The historical index data refers to index values corresponding to a plurality of original factor nodes with causal relation which have been actually collected, for example, when the index value of one original factor node is y1, the index value of another original factor node with causal connection edge is y2. In this way, the machine learning model is trained with the ability to predict the impact function between different factor nodes with dependencies.

The causal graph can be determined manually, wherein nodes included in the causal graph are the plurality of original factor nodes, the causal graph is a directed acyclic graph, and directed connecting edges between two original factor nodes represent the dependency relationship between the two. For example, the direction of the primitive node a to the primitive node b means that the primitive node b depends on the primitive node a, or that the primitive node b is affected by the primitive node a.

In general terms, the impact function corresponding to an original factor node describes whether or not the value of the original factor node (actually, the value of the index corresponding to the original factor node) is affected by other factor nodes, and how it is specifically affected.

For ease of understanding, a causal graph of a plurality of primitive nodes and their composition is illustrated in connection with FIG. 2.

In fig. 2, corresponding to the access situation that the target user access request calls the service interface 1 (denoted as O1) and the service interface 1 calls the service interface 3 (denoted as O3), a plurality of primitive nodes include RT, ER, QPS indexes (denoted as O1 in the figure) corresponding to the two service interfaces respectively illustrated in the figure ^RT 、O1 ^ER 、O1 ^QPS 、O3 ^RT 、O3 ^ER 、O3 ^QPS The execution time and the waiting time of the service interface 1, and performance indexes (shown as S1 in the figure) of the service corresponding to the service interface 1 ^xx For example, the number of service threads is included), the performance index of the machine where the service interface 1 is located (denoted as M1 in the figure) ^xx Such as including CPU utilization, etc.), RT (shown as DB in the figure) where the service interface 1 queries the database ^RT ) Some feature attributes corresponding to the target user access request, such as a service version number, a location area to which the target user access request belongs, and the like, which are illustrated in the figure. The service version number refers to a software version number of a service corresponding to the service interface 1. Directed arrows in the graph represent the dependency relationships between the nodes of different primitive factors.

After the influence functions corresponding to the original factor nodes are obtained, decomposing the original factor nodes according to the influence functions corresponding to the original factor nodes to obtain a plurality of first sub-factor nodes which are affected by other factor nodes and a plurality of second sub-factor nodes which are not affected by other factor nodes and correspond to the original factor nodes.

Factor node decomposition is to decompose an original factor node into two sub-factor nodes, one of which is a sub-factor node affected by other factor nodes and the other is a sub-factor node unaffected by other factors. The factor nodes are decomposed because factors which do not cause the abnormality of the target index of the target user access request are removed, and if the value of one original factor node is completely influenced by other original factor nodes, the original factor node is not always the root factor; if the value of an original factor node depends only on itself or partly on itself, this part which depends on itself may be the root cause.

Because the influence function corresponding to each original factor node gives out how the corresponding original factor node is influenced by the other original factor nodes, the original factor node can be decomposed according to the influence function corresponding to each original factor node to obtain a first sub-factor node influenced by the other factor nodes and a second sub-factor node not influenced by the other factor nodes.

It will be appreciated that if one primordial factor node is affected by itself only and is not affected by other factor nodes, then the first sub-factor node is absent or set to 0, and conversely, if one primordial factor node is affected by other factor nodes only, then the second sub-factor node is absent or set to 0.

For ease of understanding, the decomposition of the factor node described above is illustrated below in connection with the access scenario shown in FIG. 3.

In fig. 3, it is assumed that the call relationship corresponding to a certain user access request is: the user access request is sent to the service interface A, the service interface A asynchronously calls the service interface B and the service interface C, and the service interface C synchronously calls the service interface D twice. Assuming that the RT of the user access request is normal (less than the set threshold), the RT of the four service interfaces is 3/1/2/1 second(s), and the four values are actually normal index values corresponding to the service interfaces. And assuming that the RT of the user access request is abnormal (greater than a set threshold), the RT of the four service interfaces is doubled to 6s/2s/4s/2s, and the four values are actually the abnormality index values corresponding to the service interfaces. In addition, for service interface a, it is assumed that it needs to perform two operations: operation A1 is performed first based on the returned result of service interface B and service interface C, and then operation A2 is performed. In the above normal case, RT of the operation A2 is 1s, and the abnormal time becomes 2s.

In the above example scenario, it will be appreciated that the RT of service interface C is in fact fully affected by the two synchronous calls of service interface D, and that the RT of service interface a is affected by service interface B and service interface C as well as by itself in service interface a. The influence of the service interface B and the service interface C on the RT of the service interface a is shown as follows: service interface a needs to wait for service interface B and the service interface with the largest RT among service interfaces C to return a response before proceeding with subsequent operation A2.

In the above example, the RT of the service interface a corresponds to the primitive node a ^RT The impact function thereof can be expressed as:

g_A ^RT ＝A1 ^RT +A2 ^RT ＝MAX(B ^RT ,C ^RT )+A2 ^RT wherein A1 ^RT 、A2 ^RT RT, MAX () representing the above operations A1 and A2, respectively, is the maximum operator, B ^RT And C ^RT The RT of the service interfaces B and C are shown respectively, that is, two original factor nodes corresponding to the RT of the service interfaces B and C.

Original factor node B corresponding to RT of service interface B ^RT The impact function thereof can be expressed as: g_B ^RT ＝0+B ^RT 。

Original factor node C corresponding to RT of service interface C ^RT The impact function thereof can be expressed as: g_C ^RT ＝2*D ^RT +0。

Original factor node D corresponding to RT of service interface D ^RT The impact function thereof can be expressed as: g_D ^RT ＝0+D ^RT 。

Based on the above-exemplified influence functions of the original factor nodes, each influence function can be expressed as a form of a first term and a second term, wherein the first term is a part influenced by other factor nodes, and the second term is a part not influenced by other factor nodes, namely a part only related to the second term.

Accordingly, the corresponding original factor nodes can be decomposed according to the influence function corresponding to each original factor node, and only the part corresponding to the second item, namely the second sub factor node, is reserved. In the above example, the primitive factorNode A ^RT The corresponding second sub-factor node is A2 ^RT Original factor node B ^RT The corresponding second sub-factor node is B ^RT Original factor node C ^RT The corresponding second sub-factor node is 0, the original factor node D ^RT The corresponding second sub-factor node is D ^RT 。

It should be noted that, because the service interface C in the above example synchronously calls the service interface D twice, in order to distinguish the two calls, the two calls may be respectively represented as the call service interfaces D1 and D2, thereby the above 2*D ^RT This term can be expressed as D1 ^RT And D2 ^RT The sum of these two sub-items. While the original factor node D ^RT In practice it may be replaced by: original factor node D1 ^RT Original factor node D2 ^RT The respective influence functions can be expressed as: g_D1 ^RT ＝0+D1 ^RT ，g_D2 ^RT ＝0+D2 ^RT 。

After a plurality of second sub-factor nodes corresponding to the plurality of original factor nodes are obtained, determining contribution values of the plurality of second sub-factor nodes, wherein the contribution values reflect contribution degrees of the corresponding second sub-factor nodes to target index abnormality causing the target user access request, and further determining root factor nodes causing the target index abnormality of the target user access request from the plurality of original factor nodes according to the contribution values of the plurality of second sub-factor nodes.

Specifically, first, determining contribution values of a plurality of second sub-factor nodes; then, according to the contribution values of the plurality of second sub-factor nodes, determining the contribution values corresponding to the original factor nodes corresponding to the plurality of second sub-factor nodes; and finally, determining the root factor node which causes the abnormality of the target index value of the target user access request from the plurality of original factor nodes according to the contribution values corresponding to the plurality of original factor nodes.

In practical applications, optionally, the contribution value may be a Xia Puli value (Shapley value), which may be calculated by adopting a conventional calculation manner of a sharp value, and may refer to the prior art specifically, and will not be described herein.

According to the contribution values of the plurality of second sub-factor nodes, determining the contribution values corresponding to the original factor nodes corresponding to the plurality of second sub-factor nodes respectively may be: and mapping Xia Puli values corresponding to the second sub-factor nodes into Xia Puli values of the corresponding original factor nodes according to the corresponding relations between the second sub-factor nodes and the original factor nodes. Assuming that one source factor node corresponds to more than one second sub-factor node, the sum of Xia Puli values of the several second sub-factor nodes is taken as the eplerian value of the source factor node.

After obtaining the eplery values corresponding to the plurality of original factor nodes, it may optionally be determined that the original factor node whose Xia Puli value is a non-zero value is a root factor node that causes an abnormality in the target index value of the target user access request. Alternatively, the normalization processing may be performed on the eplerian values corresponding to the plurality of original factor nodes, so as to obtain a duty ratio of Xia Puli values of each original factor node, filter the original factor nodes with a duty ratio lower than a set threshold (for example, 10%), and sequentially output the remaining original factor nodes according to the descending order of the eplerian values, as the root factor nodes that cause the abnormality of the target index value of the target user access request.

For example, the SUM of Xia Puli values of a plurality of primitive nodes is SUM, and if the ratio of Xia Puli value of a primitive node to SUM is less than 10%, the primitive node is filtered.

The related personnel know the target index value abnormality of the target user access request caused by the abnormality of certain performance indexes such as certain service interfaces, certain machines and the like according to the original factor node serving as the root factor node, and can perform targeted optimization processing.

Fig. 4 is a flowchart of a root cause determining method according to an embodiment of the present invention, as shown in fig. 4, where the method includes the following steps:

401. and if the target index of the target service interface in the application system is abnormal in the target time period, acquiring a plurality of user access requests for calling the target service interface in the target time period.

402. And counting access characteristic modes corresponding to the plurality of user access requests, and screening target user access requests conforming to the counted access characteristic modes from the plurality of user access requests, wherein the access characteristic modes comprise attribute values of a plurality of characteristic attributes.

403. And respectively determining a plurality of original factor nodes related to the target index abnormality of each target user access request, wherein the plurality of original factor nodes comprise the plurality of characteristic attributes.

404. And decomposing the plurality of original factor nodes according to the influence functions corresponding to the plurality of original factor nodes of each target user access request to obtain a plurality of first sub-factor nodes influenced by other factor nodes and a plurality of second sub-factor nodes not influenced by other factor nodes.

405. Determining contribution values of a plurality of second sub-factor nodes corresponding to the plurality of target user access requests, and determining an average contribution value of the target second sub-factor nodes according to the contribution values of the plurality of target user access requests corresponding to the target second sub-factor nodes, wherein the target second sub-factor nodes are any one of the plurality of second sub-factor nodes corresponding to the plurality of target user access requests.

406. And determining the factor node causing the abnormality of the target index values of the target user access requests from the original factor nodes corresponding to the target user access requests according to the average contribution values of the second different factor nodes.

In the present embodiment, another determination process of determining a target user access request as a target object of root cause determination is provided. In this determination, two concepts of temporal granularity are introduced: time windows and time slices, wherein a time slice is a small statistical granularity, such as 1 minute as exemplified above. The time window is a large time granularity, such as 1 hour as exemplified above, referring to the last hour of the current time.

For convenience of description, in this embodiment, the target index RT is taken as an example, and other indexes are the same.

The application system comprises a plurality of services, each service is provided with a corresponding service interface, and the corresponding RT statistic value (taking average RT value as an example) of each service interface in continuous respective time slices can be counted based on the generated log data. In practice, the execution of each user access request generates relevant log data, in which the execution of the user access request is recorded in detail, including, but not limited to, which service interfaces are invoked, the corresponding performance index when each service interface processes the user access request, and some attributes or parameters carried in the user access request, such as a user identifier, a geographic area to which the user belongs, a service version number, and so on. Based on the log data, the service interface calling relation and the performance index of the service interface in the execution process of each user access request can be extracted through the set index and calling relation extracting tool.

Assuming that the current one target time period, determined based on the length of the time window, is T1-T2, the length is 1 hour, and assuming that the time slice is 1 minute, there will be 60 RT values (i.e., 60 average RT values) per service interface within the target time period.

Taking any target service interface in the application system as an example, after 60 RT values of the target service interface in the target time period are obtained, an abnormality detection process may be performed to determine whether the target service interface is abnormal. That is, it is necessary to determine whether the target indicator, which is RT of the target service interface in the target period, is abnormal according to the 60 RT values of the target service interface in the target period.

In an alternative embodiment, the above 60 RT values may be respectively subjected to a periodic trend decomposition process, so as to obtain a periodic component, a trend component, and a residual component corresponding to each RT value. The sum of the trend component and the residual component of each RT value is calculated, assuming that the result of the addition of the trend component and the residual component of the RT value is called a fluctuation component. Then, for example, the average and variance of the fluctuation components of 60 RT values are calculated, and if the value of the fluctuation component of a certain RT value is greater than the average of the fluctuation components by at least n by variance (n is greater than 1, such as n=3), then it is determined that the RT value is abnormal. If the duty ratio of the abnormal RT value in the 60 RT values is larger than a set ratio (such as 50%), determining that the target service interface is abnormal in the target time period.

And if the target index of the target service interface in the application system is abnormal in the target time period, acquiring a plurality of user access requests for calling the target service interface in the target time period. A target user access request as a target object is then determined from the plurality of user access requests.

Specifically, access feature patterns corresponding to the plurality of user access requests may be extracted first, where each access feature pattern includes attribute values of a plurality of feature attributes.

In practice, the plurality of user access requests may correspond to one or more access characteristic patterns. A plurality of feature attributes, such as a user identifier, a location area to which the user belongs, a service version number, and the like, may be preset, and attribute values of the plurality of feature attributes corresponding to the plurality of user access requests may be extracted, so as to mine a combination of attribute values of the feature attributes having commonality. For example, if the analysis finds that N1 user access requests all correspond to the combination of the attribute values of the feature attribute of user identifier=user1, location area=l1 and service version number=v1, then "user identifier=user1, location area=l1 and service version number=v1" is determined as an access feature mode; when N2 user access requests are found to correspond to the combination of the attribute values of the feature attribute of user identifier=user1 and location area=l2, it is determined that "user identifier=user2 and location area=l2" is another access feature mode. It should be noted that if the number of user access requests corresponding to a certain access feature pattern is too small, for example, is lower than a set threshold, the access feature pattern may be filtered out.

And then, screening target user access requests conforming to the access characteristic modes from the plurality of user access requests according to each access characteristic mode finally reserved. Alternatively, if the number of user access requests that are screened out for a certain access characteristic pattern is large (greater than a set threshold), the set threshold number of user access requests may be sampled therefrom as target user access requests that correspond to such access characteristic pattern.

In summary, the extraction of the access feature modes corresponds to grouping the plurality of user access requests according to the access feature modes, and the access feature modes corresponding to the same group of user access requests are the same.

In this embodiment, for convenience of description, it is assumed that only one access feature mode is used, after obtaining a plurality of target user access requests corresponding to the access feature mode, a plurality of source factor nodes related to abnormality of a target index (such as RT) of each target user access request are respectively determined, and a plurality of source factor nodes corresponding to each target user access request include a plurality of feature attributes in the access feature mode. And then decomposing the plurality of original factor nodes according to the influence functions corresponding to the plurality of original factor nodes of each target user access request to obtain a plurality of first sub-factor nodes influenced by other factor nodes and a plurality of second sub-factor nodes not influenced by other factor nodes. And then, determining contribution values of a plurality of second sub-factor nodes corresponding to each target user access request according to different target user access requests. The second sub-factor nodes corresponding to different target user access requests may or may not overlap. For example, a certain sub-factor node is only included in a plurality of second sub-factor nodes corresponding to the first target user access request, and another sub-factor node is simultaneously included in a plurality of second sub-factor nodes corresponding to the first target user access request and a plurality of second sub-factor nodes corresponding to the second target user access request.

Assuming that the plurality of target user access requests correspond to N second sub-factor nodes altogether, for any target second sub-factor node, calculating at least one contribution value by the target second sub-factor node under at least one target user access request, and calculating an average value of the at least one contribution value according to the number of the target user access requests to obtain an average contribution value of the target second sub-factor node. Wherein, the average calculation of the at least one contribution value according to the number of the target user access requests means that: assuming that the number of the access requests of the target user is P1 and the number of the at least one contribution value corresponding to the target second sub-factor node is P2, the sum of the P2 contribution values divided by P1 is taken as the average contribution value of the target second sub-factor node.

And then, determining the contribution value of each original factor node according to the mapping relation between the different second sub-factor nodes and the original factor nodes. For example, the average contribution value of a certain second sub-factor node is q1, and the contribution value of the corresponding original factor node is also determined to be q1. Finally, determining the factor node causing the abnormality of the target index values of the multiple target user access requests from the multiple original factor nodes corresponding to the multiple target user access requests. Assuming that the plurality of target user access requests correspond to K original factor nodes in total, determining the root factor nodes from the K original factor nodes according to the contribution values of the K original factor nodes, and referring to the related description in the previous embodiment for the determination method.

The steps not described in the foregoing embodiments may refer to the relevant descriptions in the foregoing embodiments, which are not repeated herein.

In the above-described embodiments, for convenience of description, it is assumed only that the target service interface, which is a service interface, is abnormal for a target period of time, and that only one access characteristic pattern is summarized. The following is a brief description of a more general practical application.

Based on the above procedure for detecting the abnormality of the target service interface, similarly, assuming that, in the target period, the RTs of other service interfaces in addition to the target service interface are also abnormal, and assuming that the RTs of a total of M service interfaces in the target period are abnormal, the set C1 of user access requests for accessing at least one service interface of the M service interfaces in the target period may be acquired first. And then, carrying out access characteristic pattern extraction processing on all user access requests contained in the user access request set C1 to obtain K access characteristic patterns. And determining a user access request set C2 corresponding to each of the K access characteristic modes, namely, one user access request set C2 corresponding to each access characteristic mode.

And sampling the user access request set C2 corresponding to any access characteristic mode Ki to obtain a user access request set C3 formed by no more than a set number of user access requests. The processes of original factor node construction, factor node decomposition and contribution value determination are respectively performed on the plurality of user access requests (corresponding to the plurality of target user access requests in the above embodiment) contained in the user access request set C3, so as to finally obtain the factor node of the RT abnormality of the user access request in the user access request set C3 in the target time period under the access characteristic mode Ki.

It follows that the root cause determination process in each access characteristic mode is independent from each other.

In the above embodiment, it should be noted that, the anomaly detection processing for whether the target index of each service interface is anomalous in the target period is merely a trigger condition for triggering the subsequent root cause determination process, and the root cause determination process does not need to rely on the result of anomaly detection as an input, that is, it does not need to detect in advance whether each original factor node is truly anomalous. In addition, in the above embodiment, the root factor determining process fully considers the influence factors of different dimensions in the application system, including the access feature mode on the user side, the multiple service interfaces on the service side, the performance index of the machine, and the performance index of the middleware (such as a database), so that the root factor can be more accurately located.

In the foregoing embodiment, because the causal influence relationship, that is, the dependency relationship, between the different original factor nodes is considered, each original factor node is decomposed to obtain a plurality of second sub-factor nodes corresponding to the plurality of original factor nodes, and then the contribution values (for example, xia Puli values) of the plurality of second sub-factor nodes are calculated, which may be a conventional calculation method for the eplery value or a calculation method provided by the embodiment shown in fig. 5. The number of second sub-factor nodes is typically less than the number of original factor nodes,

fig. 5 is a flowchart of a method for determining a contribution value of a sub-factor node according to an embodiment of the present invention, where, as shown in fig. 5, the method may include the following steps:

501. and generating the directed acyclic graph corresponding to the plurality of second sub-factor nodes according to the execution sequence of the plurality of second sub-factor nodes.

502. And carrying out merging processing on the second sub-factor nodes which can be merged in the directed acyclic graph to obtain a plurality of third sub-factor nodes, wherein the plurality of third sub-factor nodes comprise second sub-factor nodes which are not merged and new factor nodes obtained after merging.

503. Contribution values of a plurality of third sub-factor nodes are determined.

504. And determining the contribution values of the plurality of second sub-factor nodes according to the contribution values of the plurality of third sub-factor nodes.

The execution order of the plurality of second sub-factor nodes may be manually input, and a directed acyclic graph corresponding to the plurality of second sub-factor nodes may be generated based on the input result. In the directed acyclic graph, one second sub-factor node points to an edge of another second sub-factor node, and the execution sequence of the two second sub-factor nodes is shown as follows: from the earlier performed to the later performed. If the execution sequence of the two second sub-factor nodes is not limited, that is, the two second sub-factor nodes can be executed in parallel, a bifurcation is formed.

The second sub-factor nodes which can be combined in the directed acyclic graph are characterized by reflecting the characteristics in the directed acyclic graph: if two second sub-factor nodes are executed successively, only one in-edge and one out-edge exist, the two second sub-factor nodes can be combined into one factor node, which is called a new factor node. In this embodiment, the sub-factor nodes obtained by combining the plurality of second sub-factor nodes are referred to as a plurality of third sub-factor nodes. Thereafter, the decumbent values of the plurality of third sub-factor nodes may be calculated as the contribution values according to a conventional decumbent value calculation method. And determining contribution values of a plurality of second sub-factor nodes according to the corresponding relation between the third sub-factor nodes and the second sub-factor nodes.

For ease of understanding, the calculation process of Xia Puli values of the above-described plurality of second sub-factor nodes is illustrated below in conjunction with fig. 6.

The plurality of second sub-factor nodes obtained by decomposing the factor nodes, which are accepted in the actual access scenario illustrated in fig. 3, include A2 ^RT 、B ^RT 、D1 ^RT And D2 ^RT For convenience of description, the symbols A2, B, D1 and D2 are simplified. The actual execution order of the several second sub-factor nodes is as follows, according to the actual access procedure in the embodiment shown in fig. 3: b and D1 can be executed in parallel, A2 can be executed after B is executed, D2 can be executed after D1 is executed, and A2 can be executed after D2 is executed.

Based on the above execution order, the directed acyclic graph illustrated in fig. 6 is generated. The left-most square block without any word added represents a virtual starting node, and because the node to be executed at the beginning is the second sub-factor nodes B and D1 executed in parallel, the virtual starting node is added, and similarly, if the node to be executed at the end is also several second sub-factor nodes executed in parallel, a virtual ending node can be added. The directed acyclic graph thus constructed is in effect a graph structure reflecting the workflow of these second sub-factor nodes.

Thereafter, a second sub-factor node capable of merging is identified from the directed acyclic graph for a merging process. Since in the directed acyclic graph illustrated in fig. 6, the two sequentially executed nodes D1 and D2 have only one incoming side and one outgoing side, respectively, the two factor nodes are combined into a new factor node, denoted as < D1, D2>.

In practice, if the influence of two factor nodes i, j can be replaced by one factor node k, it can be said that this factor node k can be decomposed into two factor nodes i and j, and that the two factor nodes i and j can also be combined into a factor node k.

By the above-described merging process, the number of factor nodes involved in the calculation of the eplerian value can be reduced.

In the case shown in fig. 6, the factor node set fm= { A2, B, < D1, D2> } obtained after the merging process includes the three third sub-factor nodes therein.

Thereafter, xia Puli values of the plurality of third sub-factor nodes are determined. The Xia Puli value calculation process comprises two links, namely a counterfactual reasoning process and a calculation process of Xia Puli value based on the counterfactual reasoning result.

Specifically, inverse fact reasoning is performed according to the influence functions corresponding to the third sub-factor nodes respectively, so as to determine the corresponding cost function values under different abnormal factor node sets. And then, according to the corresponding value of the different abnormal factor nodes, determining Xia Puli values of a plurality of third sub-factor nodes.

The cost function is used for describing the influence of abnormal conditions of the plurality of third sub-factor nodes on the target index value of the target user access request. The different sets of outlier nodes correspond to different combinations of third sub-factor nodes of the plurality of third sub-factor nodes that are assumed to be outliers. The influence functions corresponding to the third sub-factor nodes are determined by the influence functions corresponding to the original factor nodes.

Assuming that the anomaly factor node set is denoted as S, the cost function is denoted as v (), for any third sub-factor node f in the factor node set Fm, its corresponding Xia Puli valueThe calculation can be performed according to the following formula:

wherein Fm\ { f } represents the factor-removing node f in the set Fm, the symbol-! The representation of the factorial operator is made, sign IThe symbol U represents the union set, i.e. the element nodes are added into the abnormal element node set S. v _Fm Represented is a cost function for the set of factor nodes Fm.

Wherein v is _Fm And (S) represents the current abnormal factor node set S, namely the target index of the target object is the RT value of the target user access request in the embodiment. Similarly, v _Fm (S U f) represents the value of the target index of the target object determined by the root cause after adding the factor node f into the abnormal factor node set S aiming at the current abnormal factor node set S.

As can be seen from the above formula, for any third sub-factor node f in the factor node set Fm, calculating the Xia Puli value of the third sub-factor node f requires knowing the value of the corresponding value of the third sub-factor node f for each possible abnormal factor node set S: v _Fm (S) and v _Fm (S∪f)。

For fm= { A2, B,<D1,D2>for example, assuming that the third sub-factor node f=a2 to which the Xia Puli value is currently to be calculated, the abnormal-factor node set S includes the following cases: s= { B }, s= {<D1,D2>}，S＝{B，<D1,D2>}，Wherein (1)>Representing an empty set. Calculating v under each of the above _Fm (S) and v _Fm And (S U f) and substituting the value into the formula to obtain the Xia Puli value of the third subfactor node A2. And calculate v under each S _Fm (S) and v _Fm The process of the value of (S.u.f) is the inverse fact reasoning process. It follows that the inverse facts inference process is a process of assuming a value of a function under each of the abnormal-factor node sets formed by the plurality of third sub-factor nodes.

In practice, the number of the cells to be processed,wherein (1)>And the target index value of the target user access request when S is the empty set is represented, namely, the target index value of the target user access request under the condition that all the third sub-factor nodes in Fm are normal (namely, the corresponding index values are normal). G (S) represents that only the third sub-factor node in S in Fm is abnormal (namely, the corresponding index value is abnormal), and other third sub-factor nodes are normal and target index values of target user access requests are obtained. Based on this, v _Fm (S∪f)-v _Fm (S) =g (S.

In the process of the inverse fact reasoning, influence functions corresponding to a plurality of third sub-factor nodes are needed to be used. In practice, the influence functions corresponding to the plurality of third sub-factor nodes are determined based on the influence functions of the respective corresponding original factor nodes, for example, the influence functions corresponding to the plurality of third sub-factor nodes are manually updated and determined based on the influence functions of the original factor nodes.

However, it should be noted that, although the influence functions corresponding to the multiple third sub-factor nodes describe the mutual influence relationships among the index values of the different third sub-factor nodes in a generalized manner, since the multiple third sub-factor nodes are from the multiple second sub-factor nodes, and the multiple second sub-factor nodes are only portions that remain unaffected by other factor nodes after the decomposition of the original factor node, in fact, there is no influence portion of the influence functions of one third sub-factor node that has any other third sub-factor node, only the index values corresponding to the third sub-factor nodes are described in general under the current assumption that the index values corresponding to the third sub-factor nodes are normal and abnormal, respectively.

A plurality of first examples received in FIG. 3The set of three-child nodes fm= { A2, B,<D1,D2>known (e.g. manually entered) G (S) =max (RT) _B ,RT _<D1,D2> )+RT _A2 . Wherein RT _i The RT (target indicator) value representing the third sub-factor node i. The influence functions of the third sub-factor nodes give the values of the three third sub-factor nodes when RT is normal and abnormal.

In the actual access situation shown in FIG. 3, if the third sub-factor node B is located in the set S, i.e., if the RT of the third sub-factor node B is abnormal, the RT _B Otherwise, if the third subfraction node B is not located in the set S, i.e., if the RT of the third subfraction node B is normal, RT _B =1s. Similarly, if the third subfraction node A2 is located within the set S, RT _A2 Otherwise, if the third subfraction node A2 is not located in the set S, RT _A2 =1s. If the third sub-factor node<D1,D2>Within set S, RT _<D1,D2> =4s, otherwise, if the third sub-factor node<D1,D2>Not located in set S, RT _<D1,D2> ＝2s。

Based on the influence functions corresponding to the third sub-factor nodes, the corresponding cost function values under different abnormal factor node sets S can be determined through the calculation formula, and Xia Puli values corresponding to the third sub-factor nodes are calculated.

After obtaining Xia Puli values for the plurality of third sub-factor nodes, determining Xia Puli values for the plurality of second sub-factor nodes based on the eplerian values for the plurality of third sub-factor nodes, comprising:

for a new factor node obtained after merging in the plurality of third sub-factor nodes, determining at least two second sub-factor nodes in the plurality of second sub-factor nodes, wherein the at least two second sub-factor nodes merge the new factor node; determining index value variation corresponding to each of the at least two second sub-factor nodes, wherein the index value variation of any one of the second sub-factor nodes is determined according to the known abnormal index value and the known normal index value of any one of the second sub-factor nodes; determining the distribution proportion of the contribution value of the new factor node according to the ratio of the index value variation amounts corresponding to the at least two second factor nodes; and distributing the eplerian value of the new factor node to the at least two second sub-factor nodes according to the distribution proportion.

In the above example, the plurality of third sub-factor nodes includes: a, B, < D1, D2> and a plurality of second sub-factor nodes comprising: four factor nodes a, B, D1, D2, where < D1, D2> corresponds to D1 and D2.

Based on this, after obtaining Xia Puli values of the three factor nodes a, B, < D1, D2>, there is no need to further process the eplerian values corresponding to the factor nodes a and B, and for the factor nodes < D1, D2>, there is a need to determine the index value change amounts of the factor node D1 and the factor node D2, assuming that the index value change amounts are denoted by Δ1 and Δ2, respectively, and further determine the ratio of the index value change amounts of the two: and (2) splitting the Xia Puli value of the factor node of < D1, D2> into two parts according to the proportion, and correspondingly assigning values to the factor node D1 and the factor node D2.

The index value change Δ1 of the factor node D1 is determined in the following manner: the abnormality index value of the factor node D1-the normal index value of the factor node D1. In the above example scenario, the index corresponding to the factor node D1 is RT, and when the abnormality index value of the factor node D1 is RT abnormality (greater than the set threshold value) of the target user access request, the actually known RT value corresponding to the factor node D1, that is, the RT index value corresponding to the factor node D, is 2s in the above example. And the normal index value of the factor node D1 may pass through the average normal index value of the factor node D1 over the history period. The historical time period can be set in a self-defined way, namely, the time period before the target user access request is triggered. For example, in a certain period of time before the triggering of the target user access request, if the RT of the user access requests of the plurality of service interfaces corresponding to the factor node D are all lower than the set threshold, the average RT value of the service interfaces corresponding to the factor node D of the user access requests can be counted as the normal RT value of the factor node D1 under the target user access request, and the assumption is 1s. Δ1=abnormal index value-normal index value=2 s-1 s=1 s. If the normal index value is greater than the abnormal index value, the normal index value is corrected to be equal to the abnormal index value so as to avoid the occurrence of negative numbers.

In the above example, since the two factor nodes D1 and D2 are actually the same, Δ1=Δ2=1. Both the < D1, D2> corresponding eplerian values bisect.

After obtaining the Xia Puli values of the plurality of second sub-factor nodes, the root factor node causing the abnormal RT value of the target user access request may be determined based on the steps in the other embodiments.

In the above-described embodiment, in calculating the contribution values (Xia Puli values) of the plurality of third sub-factor nodes, it is necessary to traverse the various abnormal-factor node sets S, and when the number of the plurality of third sub-factor nodes is large, the calculation amount is large. In an embodiment of the present invention, for a specific form of the cost function, a method for accelerating the calculation of the contribution values (Xia Puli values) of the plurality of third sub-factor nodes is provided, as shown in fig. 7.

Fig. 7 is a flowchart of a method for determining a contribution value of a sub-factor node according to an embodiment of the present invention, where, as shown in fig. 7, the method may include the following steps:

701. and determining expressions of the cost functions corresponding to the third sub-factor nodes, wherein the calculation items related in the expressions of the cost functions consist of maximum value calculation items and addition calculation items.

702. And determining index value increment of at least two third sub-factor nodes corresponding to the maximum value calculation item, and adding the index value increment of the third sub-factor nodes corresponding to the calculation item.

703. And determining the index value increment of the third sub-factor node corresponding to the addition calculation item as the Xia Puli value of the third sub-factor node corresponding to the addition calculation item.

704. Ascending sort is carried out on index value increment of at least two third sub-factor nodes corresponding to the maximum value calculation item, and the index value increment of the at least two third sub-factor nodes is distributed to the at least two third sub-factor nodes according to the following rule to determine Xia Puli values of the at least two third sub-factor nodes: the difference between the ith index value increment and the (i-1) th index value increment is averagely distributed to the ith to nth third sub-factor nodes.

Wherein n is the number of the at least two third sub-factor nodes, i e [1, n ], and the 0 th anomaly index value is set to zero.

In the present embodiment, a way of accelerating the calculation of Xia Puli values of the third subfactor nodes is provided for the cost function composed of the maximum value calculation term and the addition calculation term. In the application scenario of the micro-service system, under the condition of determining the root cause of RT abnormality which leads to the user access request, the corresponding cost functions are embodied as the expression: is composed of a maximum value calculation term and a summation calculation term, such as G (S) =max (RT in the previous example _B ,RT _<D1,D2> )+RT _A2 Is composed of maximum value calculation term and addition calculation term, due to cost functionSo that the form of v (S) is consistent with the form of G (S), and is also composed of a maximum value calculation term and a sum calculation term, it is possible to apply the acceleration calculation method provided in the present embodiment.

In the above-described examples of the present invention,in fact, the expression of (a) is the same as G (S), and is composed of the two items, except that since it is assumed that the index values of the plurality of third sub-factor nodes are all normal at this time, the index value of each third sub-factor node takes the normal index value, in the example shown in FIG. 3, in the case that the RT value of the target user access request is normal, the RT of the third sub-factor node B _B =1s, third subfactor node<D1,D2>RT of (2) _<D1,D2> =2s, i.e. two calls to RT of factor node D. RT of third subfactor node A2 _A2 =1s. Therefore->The two values respectively correspond to the calculated result of the maximum value item and the sum item pairThe corresponding values.

Based on this, v (S) =max (RT _B -2，RT _<D1,D2> -2)+[RT _A2 -1]Consists of a maximum value calculation term and a summation calculation term. Wherein in case of an abnormal RT value of the target user access request, as can be seen from the example in FIG. 3, RT _B ＝2s，RT _<D1,D2> ＝4s，RT _A2 =2s, thus v (S) =max (2-2, 4-2) + [2-1]MAX (0, 2) +1. After the cost function expression is obtained, a corresponding third sub-factor node Xia Puli value may be calculated according to steps 703 and 704. In fact, as exemplified above, one maximum value calculation term corresponds to at least two third sub-factor nodes, and one or more maximum value calculation terms may be included in the cost function, and different maximum value calculation terms perform calculation processing respectively. The sum term can be regarded as a special maximum value term: the maximum value calculation term corresponding to only one third sub-factor node.

In the calculation process, firstly, when the value function is determined to be formed by a maximum value calculation item and a summation calculation item based on the expressions of the value functions corresponding to the third sub-factor nodes, the index value increment of at least two third sub-factor nodes corresponding to the maximum value calculation item and the index value increment of the third sub-factor nodes corresponding to the summation calculation item can be determined, so that the expression of the value function is substituted, namely the value of each variable in the expression of the value function is determined. The index value increment refers to an increment between an abnormal index value and a normal index value, for example, the normal index value of the third sub-factor node A2 in the above example is 1s, and the abnormal index value is 2s. In the foregoing, in the process of determining the root cause of the target user access request, the abnormal index value is the value of the index corresponding to each third sub-factor node when the target user access request with the abnormal target index is executed, in fact, each third sub-factor node often corresponds to different indexes of different service interfaces or different indexes of different machines, and the abnormal index value of the third sub-factor node is the index value of these service interfaces and machines. Correspondingly, the normal index value is the value of the index corresponding to each third sub-factor node when the user access request with the normal target index is executed.

Then, for the addition calculation item, directly determining the index value increment of the third sub-factor node corresponding to the addition calculation item as the Xia Puli value of the third sub-factor node corresponding to the addition calculation item.

For the maximum value calculation item, ascending order is carried out on index value increment of at least two third sub-factor nodes corresponding to the maximum value calculation item, and the index value increment of the at least two third sub-factor nodes is distributed to the at least two third sub-factor nodes according to the following rule to determine Xia Puli values of the at least two third sub-factor nodes: the difference between the ith index value increment and the (i-1) th index value increment is averagely distributed to the ith to nth third sub-factor nodes. Wherein n is the number of the at least two third sub-factor nodes, i e [1, n ], and the 0 th anomaly index value is set to zero.

In the example of v (S) =max (0, 2) +1 described above, the Xia Puli value of the third sub-factor node A2 is directly determined to be 1. The execution procedure according to the allocation rule described above for MAX (0, 2) is: 2-0=2, assigned to the third subfactor node < D1, D2>; 0-0=0, and are equally distributed to the third sub-factor nodes B and < D1, D2>, and finally, the third sub-factor nodes B and < D1, D2> respectively accumulate the distribution results to obtain Xia Puli values of the two third sub-factor nodes respectively: 0,2.

Thereafter, respective Xia Puli values of the corresponding plurality of second sub-factor nodes are determined based on the respective Xia Puli values of the plurality of third sub-factor nodes. In the above example, the second sub-factor nodes are A2, B, D1, D2, xia Puli values are: 1. 0, 1. The determination of the Xia Puli values corresponding to D1 and D2 are described with reference to the other embodiments.

Thereafter, the Xia Puli values of the plurality of original factor nodes are determined further based on the Xia Puli values of the plurality of second sub-factor nodes. The plurality of original factor nodes are A, B, C, D, and the universe values are respectively: 1. 0, 2. Wherein, since the original factor node C has been culled when the factor node is decomposed, its Xia Puli value defaults to 0. Where the primordial factor node D actually includes D1 and D2, so its Xia Puli value is the sum of the Xia Puli values corresponding to D1 and D2, respectively. And the original factor node with the eplery value of 0 is not the root factor node, and the root factor node is output in the original factor node with the non-zero value according to the sequence.

The description above is made of the eplerian value calculation process of the third sub-factor node corresponding to the maximum value calculation item taking the example case shown in fig. 3 as an example, and the following briefly describes the generic calculation process thereof:

Assume that a certain maximum value calculation term is: max (x_1, x_2,) x_n, where x_1, x_2,) x_n represents the abnormality index values of the n third sub-factor nodes. And, x_0=0 is set. Firstly, the abnormality index values may be sorted in ascending order, and the sorting result is assumed to be: x_1, x_2,..x_n. If the two values are the same, the order of the sorting positions of the two values is not limited, and any one of the two values may be the former one. Thereafter, the values of (x_n-x_ { n-1 }) are assigned to the nth third sub-factor node in reverse order, the values of (x_ { n-1} -x_ { n-2 }) are assigned on average to the nth-1 third sub-factor node and the nth third sub-factor node, and the values of (x_i-x_ { i-1 }) are assigned on average to all third sub-factor nodes within the set { i, i+1. The time complexity of this calculation method is O (n log n), and it is seen that the effect of accelerating the calculation can be achieved.

Fig. 8 is an application schematic diagram of a root cause determining method according to an embodiment of the present invention, where in fig. 8, the method may be performed in a cloud, and a cluster formed by a plurality of computing nodes may be deployed in the cloud, and each computing node has processing resources such as computation and storage. In this embodiment, the cluster provides a root cause analysis service, and a service interface can be provided for calling. The service interfaces include a software development kit (Software Development Kit, abbreviated as SDK), an application program interface (Application Programming Interface, abbreviated as API), and the like.

Taking the micro service system as an example, a plurality of services contained in the micro service system can also be deployed in the cloud, and the micro service system is assumed to be deployed in another cluster in the cloud. The cluster for providing the root cause analysis service can be regularly called to upload log data, so that the root cause analysis service cluster can timely find out a target user access request with abnormal target indexes and perform root cause determination processing on the target user access request. In order to simplify the description, as shown in fig. 8, computing nodes in the root cause service cluster may sequentially perform processing of the processes of service interface performance index statistics, abnormal service interface detection, access feature pattern extraction, factor node contribution value calculation, root cause factor node output, and the like. The factor node contribution value calculation process involves the processes of factor node decomposition, workflow construction of a directed acyclic graph, factor node merging, inverse fact reasoning, contribution value calculation and the like. These processes may be referred to the description of the related embodiments, and are not described herein. Finally, the root cause of the abnormality of the target index of the target user access request can be obtained.

The root cause determination device of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means may be configured by the steps taught by the present solution using commercially available hardware components.

Fig. 9 is a schematic structural diagram of a root cause determining apparatus according to an embodiment of the present invention, as shown in fig. 9, where the apparatus includes: a factor establishing module 11, a factor decomposing module 12 and a root cause determining module 13.

The factor establishing module 11 is configured to determine, according to multiple indexes associated with multiple service interfaces invoked in an application system by a target user access request with abnormal target indexes, multiple original factor nodes related to the target index abnormality of the target user access request, where the multiple indexes include the target index.

A factor decomposition module 12, configured to determine an influence function corresponding to each of the plurality of original factor nodes, where the influence function is used to describe a dependency relationship between different factor nodes; and decomposing the plurality of original factor nodes according to the influence functions corresponding to the plurality of original factor nodes to obtain a plurality of first sub-factor nodes which correspond to the plurality of original factor nodes and are influenced by other factor nodes and a plurality of second sub-factor nodes which are not influenced by other factor nodes.

A root cause determining module 13, configured to determine contribution values of the plurality of second sub-factor nodes, where the contribution values reflect contribution degrees of the corresponding second sub-factor nodes to the target index anomaly that causes the target user access request; and determining a factor node causing the abnormality of the target index value of the target user access request from the plurality of original factor nodes according to the contribution values of the plurality of second sub-factor nodes.

The apparatus shown in fig. 9 may perform the steps provided in the foregoing embodiments, and detailed execution and technical effects are referred to the descriptions in the foregoing embodiments, which are not repeated herein.

In one possible design, the root cause determining device shown in fig. 9 described above may be implemented as an electronic device. As shown in fig. 10, the electronic device may include: a processor 21, a memory 22, a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, causes the processor 21 to at least implement the root cause determination method as provided in the previous embodiments.

Additionally, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to at least implement a root cause determination method as provided in the previous embodiments.

The above described apparatus embodiments are merely illustrative, wherein the network elements illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by adding necessary general purpose hardware platforms, or may be implemented by a combination of hardware and software. Based on such understanding, the foregoing aspects, in essence and portions contributing to the art, may be embodied in the form of a computer program product, which may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A root cause determination method, comprising:

2. The method of claim 1, wherein the determining, from the plurality of source factor nodes, a root factor node that causes a target index value abnormality of the target user access request according to the contribution values of the plurality of second sub-factor nodes, comprises:

determining contribution values corresponding to the original factor nodes respectively corresponding to the second sub-factor nodes according to the contribution values of the second sub-factor nodes;

and determining a root factor node which causes the abnormality of the target index value of the target user access request from the plurality of original factor nodes according to the contribution values corresponding to the plurality of original factor nodes.

3. The method according to claim 1, wherein the method further comprises:

and if the target index of the target service interface in the application system is abnormal in the target time period, acquiring the target user access request for calling the target service interface in the target time period, wherein the plurality of service interfaces called by the target user access request comprise the target service interface.

4. The method of claim 3, wherein the determining the contribution values of the plurality of second sub-factor nodes comprises:

If the target user access requests for calling the target index abnormality of the target service interface in the target time period are multiple, determining contribution values of a plurality of second sub-factor nodes corresponding to the target user access requests;

the determining, from the plurality of original factor nodes, a factor node that causes abnormality of a target index value of the target user access request according to the contribution values of the plurality of second sub-factor nodes, including:

determining an average contribution value of a target second sub-factor node according to contribution values corresponding to the target user access requests on the target second sub-factor node, wherein the target second sub-factor node is any one of a plurality of second sub-factor nodes corresponding to the target user access requests;

and determining the factor node causing the abnormality of the target index values of the target user access requests from the original factor nodes corresponding to the target user access requests according to the average contribution values of the second different factor nodes.

5. A method according to claim 3, characterized in that the method further comprises:

if the number of the user access requests for calling the target index abnormality of the target service interface is multiple in the target time period, counting access characteristic modes corresponding to the multiple user access requests, wherein the access characteristic modes comprise attribute values of multiple characteristic attributes;

Screening user access requests conforming to the access characteristic mode from the plurality of user access requests;

the determining a plurality of primitive factor nodes related to target index anomalies of the target user access request includes:

and if the screened user access request comprises the target user access request, determining a plurality of original factor nodes related to target index abnormality of the target user access request, wherein the plurality of original factor nodes comprise the plurality of characteristic attributes.

6. The method of claim 1, wherein the determining the contribution values of the plurality of second sub-factor nodes comprises:

generating a directed acyclic graph corresponding to the plurality of second sub-factor nodes according to the execution sequence of the plurality of second sub-factor nodes;

combining the second sub-factor nodes which can be combined in the directed acyclic graph to obtain a plurality of third sub-factor nodes, wherein the plurality of third sub-factor nodes comprise second sub-factor nodes which are not combined and new factor nodes obtained after combination;

determining contribution values of the plurality of third sub-factor nodes;

and determining the contribution values of the plurality of second sub-factor nodes according to the contribution values of the plurality of third sub-factor nodes.

7. The method of claim 6, wherein determining the contribution values of the plurality of second sub-factor nodes from the contribution values of the plurality of third sub-factor nodes comprises:

for the new factor node obtained after the combination in the plurality of third sub-factor nodes, determining at least two second sub-factor nodes in the plurality of second sub-factor nodes, wherein the new factor node is combined with the at least two second sub-factor nodes;

determining index value variation corresponding to each of the at least two second sub-factor nodes, wherein the index value variation of any one of the second sub-factor nodes is determined according to the known abnormal index value and the known normal index value of any one of the second sub-factor nodes;

determining the distribution proportion of the contribution value of the new factor node according to the ratio of the index value variation amounts corresponding to the at least two second factor nodes;

and distributing the contribution value of the new factor node to the at least two second sub-factor nodes according to the distribution proportion.

8. The method of claim 6, wherein the determining the contribution values of the plurality of third sub-factor nodes comprises:

performing inverse fact reasoning according to influence functions corresponding to the third sub-factor nodes respectively to determine corresponding cost function values under different abnormal factor node sets, wherein the cost function is used for describing the influence of abnormal conditions of the third sub-factor nodes on target index values of the target user access request, the different abnormal factor node sets correspond to different combinations of the third sub-factor nodes which are supposed to be abnormal in the third sub-factor nodes, and the influence functions corresponding to the third sub-factor nodes are determined by the influence functions corresponding to the original factor nodes respectively;

And determining the eplerian values of the plurality of third sub-factor nodes as the contribution values of the plurality of third sub-factor nodes according to the corresponding cost function values under the different abnormal factor node sets.

9. The method of claim 6, wherein the determining the contribution values of the plurality of third sub-factor nodes comprises:

determining expressions of cost functions corresponding to the plurality of third sub-factor nodes, wherein the cost functions are used for describing the influence of abnormal conditions of the plurality of third sub-factor nodes on target index values of the target user access requests;

if the calculation term involved in the expression of the cost function is composed of a maximum calculation term and a summation calculation term, the Xia Puli values of the plurality of third sub-factor nodes are determined as follows:

determining index value increment of at least two third sub-factor nodes corresponding to the maximum value calculation item and index value increment of the third sub-factor nodes corresponding to the addition calculation item; wherein the index value increment refers to an increment between an abnormal index value and a normal index value;

determining the index value increment of the third sub-factor node corresponding to the addition calculation item as the Xia Puli value of the third sub-factor node corresponding to the addition calculation item;

Ascending sort is performed on index value increments of at least two third sub-factor nodes corresponding to the maximum value calculation item, and the index value increments of the at least two third sub-factor nodes are distributed to the at least two third sub-factor nodes according to the following rule to determine Xia Puli values of the at least two third sub-factor nodes:

the difference value between the ith index value increment and the (i-1) th index value increment is averagely distributed to the ith to nth third sub-factor nodes; wherein n is the number of the at least two third sub-factor nodes, i e [1, n ], and the 0 th index value increment is set to zero.

10. The method according to any one of claims 1 to 9, wherein the determining a plurality of primitive nodes related to target-index anomalies of the target-user access request according to a plurality of indexes associated with a plurality of service interfaces invoked in an application by the target-user access request, comprises:

determining at least one performance index of a machine corresponding to each service interface in the plurality of service interfaces;

Determining target performance indexes of a target service interface and target performance indexes of a machine corresponding to the target service interface in the plurality of original factor nodes, wherein the target service interface is any one of the plurality of service interfaces, the target performance index of the target service interface is any one of at least one performance index corresponding to the target service interface, and the target performance index of the machine corresponding to the target service interface is any one of at least one performance index of the machine corresponding to the target service interface.

11. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the root cause determination method of any of claims 1 to 10.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the root cause determination method of any of claims 1 to 10.