CN117061330A

CN117061330A - Root cause positioning model training method, root cause positioning method and root cause positioning device

Info

Publication number: CN117061330A
Application number: CN202311093384.1A
Authority: CN
Inventors: 陶洪元; 余航; 李建国
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-14

Abstract

One or more embodiments of the present disclosure provide a root cause positioning model training method, a root cause positioning method and a root cause positioning device, which relate to the technical field of computers. The method includes constructing a causal graph of a target system, the causal graph including a plurality of nodes and an adjacency matrix of the plurality of nodes; inputting the node data of the plurality of nodes into a root cause positioning model by combining the adjacent matrix to obtain the prediction data of the plurality of nodes; and according to the loss between the predicted data and the node data, adjusting parameters to be learned in the root cause positioning model to complete training of the root cause positioning model, wherein the root cause positioning model is used for positioning a container with faults in a target system. According to the scheme provided by the specification, the running data do not need to be marked in advance in the training process of the root cause positioning model, and the causal relationship among all containers can be fully learned by combining the causal graph of the target system, so that the root cause positioning model capable of accurately positioning is trained.

Description

Root cause positioning model training method, root cause positioning method and root cause positioning device

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technology, and in particular, to a root cause positioning model training method, a root cause positioning method, and a root cause positioning device.

Background

With the advent of micro-service and cloud computing, when an application service or key performance index in a micro-service system is abnormal, operation and maintenance personnel are required to accurately root cause positioning and quickly recover a failed container so as to ensure the stability and reliability of the micro-service system, and simultaneously reduce downtime and performance degradation to the greatest extent.

In the related art, a neural network model is generally used for root cause localization. However, these models need to be trained based on manually marked data, a lot of labor cost is required in the marking process, and the complete causal relationship structure in the system cannot be learned, so that the accuracy is poor.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a root cause positioning model training method, root cause positioning method and apparatus.

In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

according to a first aspect of one or more embodiments of the present specification, a root cause positioning model training method is provided, including:

constructing a causal graph of the target system, wherein the causal graph comprises a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating containers in the target system and the adjacency matrix is for indicating causal relationships between the plurality of containers;

Inputting the node data of the plurality of nodes into a root cause positioning model by combining the adjacent matrix to obtain the prediction data of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

and according to the loss between the predicted data and the node data, adjusting parameters to be learned in the root cause positioning model to complete training of the root cause positioning model, wherein the root cause positioning model is used for positioning a container with faults in a target system.

According to a second aspect of one or more embodiments of the present specification, there is provided a root cause positioning method, including:

under the condition that a target system fails, acquiring operation data of each of a plurality of containers in the target system;

constructing a causal graph of the target system, wherein the causal graph comprises a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating a plurality of containers and the adjacency matrix is for indicating causal relationships between the plurality of containers;

the method comprises the steps of inputting node data of a plurality of nodes into an encoder of a root cause positioning model by combining an adjacent matrix to obtain respective intervention probability distribution of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

And locating the failed container in the target system according to the intervention probability distribution of each of the plurality of nodes.

According to a third aspect of one or more embodiments of the present specification, there is provided a root cause positioning model training apparatus, comprising:

a building module for building a causal graph of the target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating containers in the target system and the adjacency matrix is for indicating causal relationships between the plurality of containers;

the input module is used for inputting the node data of the plurality of nodes into the root cause positioning model by combining the adjacent matrix to obtain the prediction data of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

and the adjusting module is used for adjusting parameters to be learned in the root cause positioning model according to the loss between the prediction data and the node data so as to complete the training of the root cause positioning model, wherein the root cause positioning model is used for positioning a container with a fault in the target system.

According to a fourth aspect of one or more embodiments of the present specification, there is provided a root cause positioning device comprising:

the acquisition module is used for acquiring the operation data of each of a plurality of containers in the target system under the condition that the target system fails;

A building module for building a causal graph of the target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating a plurality of containers and the adjacency matrix is for indicating causal relationships between the plurality of containers;

the input module is used for combining the node data of the plurality of nodes with the adjacent matrix to input the encoder of the root cause positioning model to obtain the respective intervention probability distribution of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

and the positioning module is used for positioning the container with the fault in the target system according to the intervention probability distribution of each of the plurality of nodes.

According to a fifth aspect of one or more embodiments of the present specification, there is provided an electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements a method as in the first aspect and/or a method as in the second aspect by executing executable instructions.

According to a sixth aspect of one or more embodiments of the present description, a computer-readable storage medium is presented, on which computer instructions are stored, which instructions, when executed by a processor, implement steps as the method of the first aspect and/or steps as the method of the second aspect.

The root cause positioning model training method provided by the specification can construct a causal graph of a target system, wherein the causal graph comprises a plurality of nodes and an adjacency matrix of the plurality of nodes. The prediction data of the plurality of nodes can be obtained by combining the node data of the plurality of nodes with the adjacency matrix and inputting the root cause positioning model. And according to the loss between the prediction data and the node data, adjusting parameters to be learned in the root cause positioning model, and thus finishing the training of the root cause positioning model. According to the scheme provided by the specification, the running data do not need to be marked in advance in the training process of the root cause positioning model, and the causal relationship among all containers can be fully learned by combining the causal graph of the target system, so that the root cause positioning model capable of accurately positioning is trained.

Drawings

FIG. 1 is a flow chart of a root cause positioning model training method according to an exemplary embodiment.

Fig. 2 is a flowchart of a method for generating prediction data according to an exemplary embodiment.

Fig. 3 is a flow chart of a root cause positioning method according to an exemplary embodiment.

Fig. 4 is a schematic diagram of an apparatus according to an exemplary embodiment.

FIG. 5 is a schematic diagram of a root cause positioning model training device according to an exemplary embodiment.

FIG. 6 is a schematic diagram of another root cause positioning device according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

However, the root cause positioning method in the related art mainly has the following two problems:

first, the root cause localization model in the related art has a limitation in causal inference. The related art relies primarily on learning causal structures from data while relying on graph traversal or anomaly score ordering to identify root cause nodes, which fails to directly accomplish root cause localization based on causal inference. Furthermore, graph traversal and anomaly score ordering in a large scale system containing a large number of containers can be computationally intensive and time consuming.

Secondly, training of root cause positioning models in the related art generally requires manually marked data, however, in reality, the data with marks tends to be low in the proportion of the total data due to the large amount of time and resources required for manually analyzing root causes.

In view of this, the present description provides a solution for constructing a causal graph of a target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes. The prediction data of the plurality of nodes can be obtained by combining the node data of the plurality of nodes with the adjacency matrix and inputting the root cause positioning model. According to the loss between the prediction data and the node data, parameters to be learned in the root cause positioning model are adjusted, and further training of the root cause positioning model can be completed. The method comprises the steps that a plurality of nodes in a causal graph are used for indicating containers in a target system, node data of each node comprise operation data of the containers indicated by the nodes, and an adjacency matrix is used for indicating causal relations among the plurality of containers. According to the scheme provided by the specification, the running data do not need to be marked in advance in the training process of the root cause positioning model, and the causal relationship among all containers can be fully learned by combining the causal graph of the target system, so that the root cause positioning model capable of accurately positioning is trained.

In addition, the embodiment of the specification also provides a root cause positioning method, and under the condition that a target system fails, the root cause positioning can be finished by combining node data with an adjacent matrix and inputting the node data into an encoder of a root cause positioning model. The method only uses the encoder part to participate in operation, so that the operation amount is small, the response speed is high, the positioning efficiency is high, and the positioning accuracy can be considered.

The present exemplary embodiment will be described in detail below with reference to the accompanying drawings and examples.

First, the embodiments of the present specification provide a root cause positioning model training method, which can be performed by any electronic device.

FIG. 1 is a flow chart of a root cause positioning model training method according to an exemplary embodiment. As shown in fig. 1, the root cause positioning model training method provided in the embodiment of the present disclosure includes the following steps.

S101, constructing a causal graph of a target system, wherein the causal graph comprises a plurality of nodes and a adjacency matrix of the plurality of nodes.

It should be noted that the target system may be a system based on any architecture, for example, a distributed system, a micro-service system, or the like. The target system includes a plurality of containers, each container for packaging a component of the system that is capable of independent operation. There may be some causal relationship between these containers. For example, the throughput of container 1 may affect the memory usage of container 2.

It should be noted that the causal graph is a directed acyclic graph, and includes a plurality of nodes and directed edges, where the directed edges are used to indicate causal relationships between the nodes. For example, in the causal graph, there is a directed edge between node 1 and node 2 that points from node 1 to node 2, then node 1 is considered to be the parent node of node 2, and node 1 can have a direct impact on node 2, with causal relationships between node 1 and node 2.

In the embodiment of the present specification, a plurality of nodes in the causal graph are used to indicate containers in the target system, and accordingly, directed edges in the causal graph are used to indicate causal relationships between the plurality of containers.

Wherein the directed edges may be represented by an adjacency matrix. The adjacency matrix may be a d x d binary matrix, where d is the number of nodes in the causal graph. Each element a in the adjacency matrix a _ij Respectively used for representing the causal relationship between the node i and the node j in the causal graph, and if the node j is the father node of the node i, a _ij =1, otherwise, a _ij ＝0。

In the present embodiment, the causal relationship between containers is a one-way transfer, for example, a reduced throughput of container 1 would cause a reduced memory usage of container 2, while a reduced memory usage of container 2 would not cause a change in throughput of container 1.

Furthermore, the causal relationship between containers in the embodiments of the present description is markov. For example, regarding the container 1, the container 2, and the container 3, even if the container 1 can directly affect the container 2, the container 2 can directly affect the container 3, in which case it is assumed that there is no direct effect between the container 1 and the container 3. That is, in a causal graph, each node is conditionally independent of the non-offspring of a given parent. Namely p (V) _i ∣pa(V _i ),nd(V _i ))＝p(V _i ∣pa(V _i ) And) wherein V _i Any node in the causal graph is represented, i represents the number of the node, p represents the conditional probability, pa (V _i ) Representing the parent node set of node i, nd (V _i ) Representing a set of non-descendant nodes of node i.

In the present embodiment, the occurrence of a failure in a container may be considered a change in container operational data reflected in a causal graph that would sever the causal relationship between the node to which the failed container corresponds and its parent node. The operation data of the container may be a metric index of the container, for example, a memory usage index, a container throughput index, a container network delay index, a container request response success rate index, and the like.

Illustratively, it is assumed that under normal conditions, the reduced throughput of the container 1 may cause the memory usage of the container 2 to be reduced, and when the container 2 fails, the reduced throughput of the container 1 may not cause the memory usage of the container 2 to be reduced, and may also cause the memory usage of the container 2 to be reduced to an unreasonable extent.

In fact, regardless of the actual failure, the container data of each normal container always conforms to its observed probability distribution, whereas the container data of the failed container will no longer conform to its observed probability distribution.

S102, inputting the node data of the plurality of nodes into a root cause positioning model by combining the adjacent matrix to obtain the prediction data of the plurality of nodes.

Wherein the node data for each node includes operational data for the container indicated by the node. For example, node V ₁ Comprises the operational data of the container 1.

In some embodiments, the operational data of the container may be historical operational data of the container in the target system, where the operational data may include operational data of both a container that is operating normally and a container that is malfunctioning, and the embodiments of the present disclosure are not limited in this regard.

It should be appreciated that to promote generalization of the root cause positioning model, the operational data of the container may also be collected in other systems having the same or similar causal graphs as the target system, which is not limited by the disclosed embodiments.

In some embodiments, the root cause localization model in embodiments of the present disclosure may be a graph neural network (the Graph Neural Network, GNN) model. The node data of the plurality of nodes are combined with the adjacency matrix and are input into the graph neural network model together, so that the prediction data of the plurality of nodes can be obtained.

S103, according to the loss between the prediction data and the node data, parameters to be learned in the root cause positioning model are adjusted so as to complete training of the root cause positioning model, and the root cause positioning model is used for positioning a container with faults in a target system.

The root cause positioning model is input to each of a plurality of sets of different container data in combination with the adjacent matrix, so that the loss between a plurality of sets of prediction data and node data can be obtained. Therefore, parameters to be learned in the root cause positioning model can be adjusted in a batch gradient descending mode, and the training of the root cause positioning model is completed.

According to the scheme provided by the specification, the running data do not need to be marked in advance in the training process of the root cause positioning model, and the causal relationship among all containers can be fully learned by combining the causal graph of the target system, so that the root cause positioning model capable of accurately positioning is trained.

The basic concept of the root cause positioning model training method in the embodiment of the present specification is introduced above, and the adjustment process of the parameter to be learned will be specifically described below with reference to the model structure of the root cause positioning model.

The root cause positioning model in this specification may include an encoder and a decoder. The encoder may fit a probability distribution of the parameters to be learned, which may be implemented by a message-passing neural network model (message-passing GNN). While the decoder may generate prediction data from the parameters to be learned, which may be implemented by an intervention-GNN model. Since both the encoder and the decoder in the present disclosure may be implemented by using the graph neural network model, for the embedded representation of each node in the encoder and the decoder, reference may be made to the representation in the graph neural network model in the related art, which is not described in detail in the embodiments of the present disclosure for brevity.

Fig. 2 is a flowchart of a method for generating prediction data according to an exemplary embodiment. The method is used for calculating the prediction data in S102, and as shown in fig. 2, the prediction data generating method provided in the embodiment of the present specification includes the following steps.

S201, node data of a plurality of nodes are combined with an adjacent matrix to be input into an encoder, and probability distribution of parameters to be learned is obtained.

It should be noted that the parameters to be learned in the embodiments of the present disclosure include a plurality of first parameters and a plurality of second parameters. Wherein the number of first parameters and the number of second parameters are the same as the number of nodes in the causal graph.

Wherein the plurality of first parameters are used to respectively show exogenous variables of each of the plurality of nodes. The exogenous variable of the node is an external factor capable of causing the change of the node data, and the exogenous variable of the node is not influenced by the data of the node itself and cannot be directly observed, so in the embodiment of the specification, the probability distribution of the first parameter is predicted by setting the encoder to fit the first parameter and starting from the node data and the adjacent matrix.

Illustratively, the prior distribution of each first parameter may be configured as a normal distribution or a uniform distribution for the encoder to fit. Taking a normal distribution as an example, i.e. a first parameter U for each node _i Can be configured with U _i N (0, 1). Where i represents the sequence number of the node, and N (0, 1) is used to represent the normal distribution.

Furthermore, it is understood that the first parameter U of each node in the embodiments of the present disclosure _i Is independent of each other, U _i There is no causal or dependency relationship between them.

And the plurality of second parameters are respectively used for showing whether each of the plurality of nodes has an intervention, wherein the container corresponding to the node with the intervention is the container with the fault.

Illustratively, the a priori distribution of each second parameter may be configured as a Bernoulli distribution for the encoder to fit. I.e. the second parameter E for each node _i Can be configured with E _i -B (p). Wherein i represents the serial number of the node, B (p) is used for representing bernoulli distribution, p is used for representing the probability of failure of the node i, and p can be set according to the actual application scene.

Ideally, if node i has intervention, E _i May take 0, otherwise, E _i 1 may be taken. As can be seen in combination with the adjacency matrix in the causal graph, if node j is the parent node of node i, i.e. there is a direct causal relationship between node j and node i, element a in adjacency matrix A _ij =1, otherwise, a _ij =0. By calculating E _i With corresponding a _ij And the product of (a) can correct the causal relationship between the nodes in the causal matrix. From the foregoing, it is appreciated that the occurrence of a container failure can be understood as an interruption in the causal relationship between containers.

Illustratively, it is assumed that there is a causal relationship between node j and node i, i.e., element a in the adjacency matrix _ij =1. When there is intervention in node i, let E _i =0. By calculating a _ij ·E _i The element in the adjacent matrix can be used forThe value of (2) is corrected to 0, thereby cutting off the causal relationship between node j and node i.

Further, it is understood that the second parameter E of each node in the embodiments of the present disclosure _i Is independent of each other, E _i There is no causal or dependency relationship between them.

Thus, the encoder can fit probability distributions of the first parameter and the second parameter corresponding to each node i, respectively, in a manner shown in the following equation (1).

X _i ＝f(apa _U (V _i ),A·E) (1)

Wherein V is _i Representing node i; x is X _i Is the actual operating data of the container indicated by the node i; f represents the function fitted by the encoder; a is an adjacent matrix; e is the second parameter E of each node _i A vector of components; apa _U (V _i ) A first set of parameters (i.e., exogenous variables) representing all ancestor nodes of node i.

Illustratively, assume that node 1 is the parent of node 2, and that node 2 is the parent of node 3, i.e., there is V in the causal graph ₁ →V ₂ →V ₃ Apa then _U (V ₃ )＝U ₁ ,U ₂ ,U ₃ 。

In some embodiments, the parameters to be learned may further include a third parameter, which is a generalization parameter of the encoder. Since the third parameter is a global generalization parameter of the encoder, the number of third parameters is independent of the number of nodes, and there may be only one third parameter.

Illustratively, the a priori distribution of the third parameter ε may be configured as a normal distribution. For example, ε to N (0, ζ) ^-1 ). Wherein N represents a normal distribution; ζ is a parameter in a normal distribution, which can be set a priori by Jeffrey's, i.e., p (ζ) ≡1/ζ.

In order to improve the generalization capability of the model, the model can be suitable for systems with the same or similar causal diagrams in different application scenes, and the generalization parameters of the model and the fitted function f of the encoder are in a summation relation. That is, in the case where the parameter to be learned includes the third parameter, the above formula (1) may be updated to the following formula (2).

X _i ＝f(apa _U (V _i ),A·E)+ε (2)

S202, sampling the parameter to be learned according to the probability distribution of the parameter to be learned to obtain a sample parameter.

The process of sampling the parameter to be learned may be random sampling in any manner, which is not limited in the embodiment of the present specification.

For example, the probability distribution of the first parameter may be sampled, resulting in a first sample parameter. And sampling the probability distribution of the second parameter to obtain a second sample parameter. And sampling the probability distribution of the third parameter to obtain a third sample parameter. The embodiments of the present disclosure will not be described in detail.

S203, inputting the sample parameters into a decoder to obtain prediction data of a plurality of nodes.

The decoder may calculate the prediction data by the following equation (3).

Wherein,predictive data for a plurality of nodes; />A vector of first sample parameters; />A vector of second sample parameters; i is an identity matrix; f (f) ₁ And f ₂ Are all arithmetic functions in the decoder.

In some embodiments, the prediction data obtained in the above formula (3) may also be combined with the third sample parameter, i.e. the third sample parameter and each obtained prediction data are summed separately as the final prediction data.

The method for calculating the prediction data in the embodiment of the present specification is described above with reference to fig. 2, and a manner of calculating the root cause loss of the positioning model using the prediction data will be described below.

In some embodiments, the encoder is configured to fit a probability distribution of the first plurality of parameters and the second plurality of parameters, the first plurality of parameters and the second plurality of parameters being adjustable according to a loss between the prediction data and the node data. That is, at this point the root cause positioning model has a loss function for each node in the causal graph ofWherein U is used to indicate a first parameter; e is used for indicating a second parameter; ELBO is the evidence lower bound (evidence lower bound) loss function; x is X _i The real operation data of the container indicated by the node i; />Is predictive data.

In some embodiments, to enhance the training effect of the root cause positioning model, the node data for each node may also include a failure tag. The failure tag can show in a 0 or 1 way whether the container indicated by the node is failed. When the failure flag is 0, the container is considered to be failed; when the failure flag is 1, the container is considered to be operating normally.

In the case where the node data includes a failure tag, the plurality of first parameters may be adjusted according to a loss between the predicted data and the node data. And adjusting the second plurality of parameters according to the loss between the second plurality of sample parameters and the fault tag. Wherein the plurality of second sample parameters is obtained by sampling a probability distribution of the plurality of second parameters.

That is, at this point the root cause positioning model has a loss function for each node in the causal graph ofWherein the log maximum likelihood estimation (Log Maximum Likelihood Estimation) loss functionCount (n)/(l)>Y is a second sample parameter obtained by sampling a second parameter probability distribution corresponding to node i _i Failure label of container indicated for node i.

It should be appreciated that during the training of the root cause positioning model, each node in the causal graph corresponds to a first parameter and a second parameter, respectively, and each node has associated container operational data and prediction data output by the decoder. In the loss calculation process, the loss between the running data and the predicted data on the same node should be calculated in units of nodes. Also, the loss between the second sample parameter corresponding to the same node and the failure label of the container corresponding to the node should be calculated in units of nodes. The complete loss of the root cause positioning model is the sum of the losses of each node, namely the complete loss is sigma _i Loss _i 。

In some embodiments, when the parameter to be learned further includes a third parameter, the third parameter is adjusted according to the loss between the prediction data and the node data, and the specific loss calculation mode of the third parameter may refer to the calculation mode of the first parameter, which is not described in detail in the embodiments of the present disclosure.

According to the method provided by the embodiment of the disclosure, whether the running data is marked or not is not required forcedly, the non-standard data or the non-standard data combined with the part with the standard data can be adopted during training, so that the non-supervision or semi-supervision model training process can be realized in the specification, the training effect is ensured, and meanwhile, the labor cost and the time cost are greatly reduced. Meanwhile, the encoder and decoder framework provided by the embodiment of the disclosure has high convergence rate and high generalization capability in the training process, so that the root cause positioning model obtained after training can be applied to different systems.

Based on the same inventive concept, a root cause positioning method is also provided in the embodiments of the present disclosure, as follows. Since the principle of solving the problem of this method embodiment is similar to that of the above method embodiment, the implementation of this method embodiment may refer to the implementation of the above method embodiment, and the repetition is not repeated.

Fig. 3 shows a flowchart of a root cause positioning method in an embodiment of the disclosure, where the method may be performed by any electronic device, and as shown in fig. 3, the root cause positioning method provided in the embodiment of the disclosure includes the following steps.

S301, under the condition that a target system fails, operation data of each of a plurality of containers in the target system is acquired.

S302, constructing a causal graph of a target system, wherein the causal graph comprises a plurality of nodes and a adjacency matrix of the plurality of nodes.

Wherein the plurality of nodes are for indicating a plurality of containers and the adjacency matrix is for indicating causal relationships between the plurality of containers.

S303, the node data of the plurality of nodes are combined with the adjacent matrix to be input into the encoder of the root cause positioning model, and the intervention probability distribution of each of the plurality of nodes is obtained.

Wherein the node data for each of the plurality of nodes includes operational data for the container indicated by the node.

It should be noted that, the root cause positioning model in the embodiment of the present disclosure may be trained according to the method illustrated in the embodiment of fig. 1, and the structure of the root cause positioning model may refer to the description in the embodiment of fig. 2. Similarly, since the plurality of second parameters in the embodiment of fig. 2 are respectively used to show whether each of the plurality of nodes has an intervention, a container corresponding to the node having the intervention is a failed container. Thus, the intervention probability distribution in the disclosed embodiments may be substantially identical to the probability distribution of the second parameter in the FIG. 2 embodiment, and may be predicted by the trained encoder based on the container operation data and the adjacency matrix.

S304, positioning a container with faults in the target system according to the intervention probability distribution of each of the plurality of nodes.

It should be noted that the root cause positioning model shown in the embodiment of fig. 2 includes an encoder and a decoder, and in practical application, root cause positioning can be completed by sampling the intervention probability distribution output by the encoder (i.e., the probability distribution of the second parameter in the embodiment of fig. 2). Thus, the decoder in the root cause positioning model does not actually participate in the operations during the reasoning process of the model.

Illustratively, the intervention value of each node in the plurality of nodes (i.e., the second parameter in the embodiment of fig. 2) may be sampled according to the respective intervention probability distribution of the plurality of nodes, to obtain the intervention value of each node. And taking the node with the minimum intervention value of the plurality of nodes as the root node, wherein the container indicated by the root node is a container with faults.

Of course, the intervention value of each node may be sampled multiple times according to the intervention probability distribution, and the result of the multiple sampling is synthesized to eliminate the error of the single sampling, which is not limited by the embodiment of the disclosure.

The root cause positioning method provided by the embodiment of the disclosure has high positioning efficiency, and only the encoder part in the root cause positioning model is required to be called to predict the probability distribution of one parameter. Moreover, the root cause training model invoked by the embodiment of the disclosure fully learns the causal relationship among all containers, so that the method has higher accuracy.

Fig. 4 is a schematic diagram of an apparatus according to an exemplary embodiment. Referring to fig. 4, at the hardware level, the device includes a processor 402, an internal bus 404, a network interface 406, a memory 408, and a nonvolatile memory 410, although other hardware required by other services is possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 402 reading a corresponding computer program from the non-volatile memory 410 into the memory 408 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.

Referring to fig. 5, fig. 5 provides a root cause positioning model training apparatus 500, which can be applied to the device shown in fig. 4 to implement the technical solution of the present specification. The root cause positioning model training device 500 may include:

a building block 501 for building a causal graph of a target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating containers in the target system and the adjacency matrix is for indicating causal relationships between the plurality of containers.

The input module 502 is configured to combine node data of a plurality of nodes with an adjacency matrix to input a root cause positioning model, so as to obtain predicted data of the plurality of nodes; wherein the node data for each of the plurality of nodes includes operational data for the container indicated by the node.

And the adjusting module 503 is configured to adjust parameters to be learned in the root cause positioning model according to the loss between the prediction data and the node data, so as to complete training of the root cause positioning model, where the root cause positioning model is used to position a container with a fault in the target system.

In some embodiments, the root cause positioning model includes an encoder and a decoder; the encoder is used for fitting probability distribution of parameters to be learned; the decoder is used for generating prediction data according to the parameters to be learned.

In some embodiments, the input module 502 is further configured to combine the node data of the plurality of nodes with the adjacency matrix to input the node data to the encoder, so as to obtain a probability distribution of the parameter to be learned; sampling the parameter to be learned according to the probability distribution of the parameter to be learned to obtain a sample parameter; and inputting the sample parameters into a decoder to obtain prediction data of a plurality of nodes.

In some embodiments, the parameters to be learned include a plurality of first parameters and a plurality of second parameters; the first parameters are respectively used for showing exogenous variables of each of the nodes; the plurality of second parameters are respectively used for showing whether each of the plurality of nodes has an intervention, wherein the container corresponding to the node having the intervention is a failed container.

In some embodiments, the adjusting module 503 is further configured to adjust the plurality of first parameters and the plurality of second parameters according to a loss between the prediction data and the node data.

In some embodiments, the node data for each of the plurality of nodes further includes a failure tag for showing whether the container indicated by the node is failed. The adjusting module 503 is further configured to adjust a plurality of first parameters according to the loss between the prediction data and the node data; and adjusting the plurality of second sample parameters according to the loss between the plurality of second sample parameters and the fault labels, wherein the plurality of second sample parameters are obtained by sampling probability distribution of the plurality of second parameters.

In some embodiments, the parameters to be learned further comprise a third parameter, the third parameter being a generalization parameter of the encoder.

In some embodiments, the third parameter is adjusted according to a loss between the predicted data and the node data.

Referring to fig. 6, fig. 6 provides a root cause positioning device 600 that can be applied to the apparatus shown in fig. 4 to implement the technical solution of the present disclosure. Among other things, root cause positioning device 600 may include:

the acquiring module 601 is configured to acquire operation data of each of a plurality of containers in the target system in case of a failure of the target system.

A building module 602 for building a causal graph of a target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating a plurality of containers and the adjacency matrix is for indicating causal relationships between the plurality of containers.

An input module 603, configured to combine node data of a plurality of nodes with an encoder of a root cause positioning model by using an adjacency matrix to obtain respective intervention probability distributions of the plurality of nodes; wherein the node data for each of the plurality of nodes includes operational data for the container indicated by the node.

A positioning module 604, configured to position the failed container in the target system according to the intervention probability distribution of each of the plurality of nodes.

In some embodiments, the positioning module 604 is further configured to sample the intervention value of each node in the plurality of nodes according to the respective intervention probability distribution of the plurality of nodes, to obtain the intervention value of each node; the node with the minimum intervention value among the plurality of nodes is taken as a root node, and the container indicated by the root node is a container with faults.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. A root cause positioning model training method comprises the following steps:

constructing a causal graph of a target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating containers in the target system, and the adjacency matrix is for indicating causal relationships between the plurality of containers;

inputting the node data of the plurality of nodes into the root cause positioning model by combining the adjacency matrix to obtain the prediction data of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

and according to the loss between the prediction data and the node data, adjusting parameters to be learned in the root cause positioning model to complete training of the root cause positioning model, wherein the root cause positioning model is used for positioning a container with faults in the target system.

2. The method of claim 1, the root cause positioning model comprising an encoder and a decoder;

the encoder is used for fitting probability distribution of the parameter to be learned;

the decoder is used for generating the prediction data according to the parameter to be learned.

3. The method of claim 2, the inputting the node data of the plurality of nodes into the root cause positioning model in combination with the adjacency matrix, obtaining the predicted data of the plurality of nodes, comprising:

inputting the node data of the plurality of nodes into the encoder by combining the adjacency matrix to obtain probability distribution of the parameter to be learned;

sampling the parameter to be learned according to the probability distribution of the parameter to be learned to obtain a sample parameter;

and inputting the sample parameters into the decoder to obtain the prediction data of the plurality of nodes.

4. A method according to claim 2 or 3, the parameters to be learned comprising a plurality of first parameters and a plurality of second parameters;

the first parameters are respectively used for showing exogenous variables of the nodes;

and the plurality of second parameters are respectively used for showing whether each of the plurality of nodes has an intervention, wherein the container corresponding to the node with the intervention is a container with a fault.

5. The method of claim 4, the adjusting parameters to be learned in the root cause positioning model according to a loss between the prediction data and the node data, comprising:

and adjusting the first parameters and the second parameters according to the loss between the predicted data and the node data.

6. The method of claim 4, the node data of each node of the plurality of nodes further comprising a failure tag for showing whether a container indicated by the node is failed;

and adjusting parameters to be learned in the root cause positioning model according to the loss between the prediction data and the node data, wherein the parameters to be learned comprise:

adjusting the plurality of first parameters according to the loss between the predicted data and the node data;

and adjusting a plurality of second sample parameters according to the loss between the plurality of second sample parameters and the fault labels, wherein the plurality of second sample parameters are obtained by sampling probability distribution of the plurality of second parameters.

7. The method of claim 4, the parameters to be learned further comprising a third parameter, the third parameter being a generalization parameter of the encoder.

8. The method of claim 7, the third parameter being adjusted according to a loss between the prediction data and the node data.

9. A root cause positioning method, comprising:

constructing a causal graph of the target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating the plurality of containers and the adjacency matrix is for indicating causal relationships between the plurality of containers;

the node data of the plurality of nodes are combined with the encoder of the adjacent matrix input root cause positioning model, so that the respective intervention probability distribution of the plurality of nodes is obtained; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

locating a failed container in the target system according to the intervention probability distribution of each of the plurality of nodes.

10. The method of claim 9, the locating a failed container in the target system according to the respective intervention probability distribution of the plurality of nodes, comprising:

According to the intervention probability distribution of each of the plurality of nodes, the intervention value of each node in the plurality of nodes is sampled respectively to obtain the intervention value of each node;

and taking the node with the minimum intervention value of the plurality of nodes as a root node, wherein the container indicated by the root node is a container with faults.

11. A root cause positioning model training device, comprising:

a building module for building a causal graph of a target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating containers in the target system, and the adjacency matrix is for indicating causal relationships between the plurality of containers;

the input module is used for inputting the node data of the plurality of nodes into the root cause positioning model by combining the adjacency matrix to obtain the prediction data of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

and the adjustment module is used for adjusting parameters to be learned in the root cause positioning model according to the loss between the prediction data and the node data so as to complete training of the root cause positioning model, wherein the root cause positioning model is used for positioning a container with a fault in the target system.

12. A root cause positioning device, comprising:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring the operation data of each of a plurality of containers in a target system under the condition that the target system fails;

a building module for building a causal graph of the target system, the causal graph comprising a plurality of nodes and an adjacency matrix of the plurality of nodes; wherein the plurality of nodes are for indicating the plurality of containers and the adjacency matrix is for indicating causal relationships between the plurality of containers;

the input module is used for combining the node data of the plurality of nodes with the encoder of the adjacent matrix input root cause positioning model to obtain the respective intervention probability distribution of the plurality of nodes; wherein the node data of each node in the plurality of nodes comprises the operation data of the container indicated by the node;

and the positioning module is used for positioning the container with faults in the target system according to the intervention probability distribution of each of the plurality of nodes.

13. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 8 and/or the method of any one of claims 9 to 10 by executing the executable instructions.

14. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 8 and/or the steps of the method of any of claims 9 to 10.