CN114490157A

CN114490157A - Fault detection method, device, equipment and storage medium

Info

Publication number: CN114490157A
Application number: CN202210091905.9A
Authority: CN
Inventors: 蔡方龙; 华石榴; 钟彬; 雒武超; 裘愉锋; 蒋群华; 施跃跃
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-13

Abstract

The embodiment of the application provides a fault detection method, a fault detection device, equipment and a storage medium, which relate to the technical field of artificial intelligence, and the method comprises the following steps: an original failure node chain is obtained, wherein the original failure node chain comprises a plurality of original failure nodes. Because the original fault node chain contains more original fault nodes, the main fault node chain is determined by selecting the main fault nodes from the original fault nodes and determining the main fault node chain based on the selected main fault nodes, and the efficiency of fault detection can be effectively improved. And finally, determining a target root fault node from the main fault node chain according to the causal probability among all the main fault nodes in the main fault node chain, thereby effectively improving the accuracy of fault detection.

Description

Fault detection method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a fault detection method, a fault detection device, fault detection equipment and a storage medium.

Background

With the rapid development of computer technology, more and more task nodes are provided in the internet service system, and the technical architecture is more and more complex. When a service system fails, operation and maintenance personnel need to quickly and accurately determine a root fault node causing the service system failure, and perform subsequent analysis and failure Processing based on monitoring data of the root fault, such as CPU (Central Processing Unit) utilization rate, memory occupancy rate, disk utilization rate, QPS (Query Per Second ), TPS (Transaction Per Second, throughput).

However, when a service system fails, monitoring data of each task node is generally analyzed based on a relationship table of each task node to determine a failed node chain, but the failed node chain cannot be specifically located to a root source failed node in the failed node chain, so that accuracy of fault detection is affected.

Disclosure of Invention

The embodiment of the application provides a fault detection method, a fault detection device, equipment and a storage medium, which are used for improving the accuracy of fault detection.

In one aspect, an embodiment of the present application provides a fault detection method, where the method includes:

acquiring an original fault node chain, wherein the original fault node chain comprises a plurality of original fault nodes;

sequentially selecting main fault nodes from the original fault nodes along the extending direction of the original fault node chain, and determining a main fault node chain based on each selected main fault node;

and determining a target root source fault node from the main fault node chain based on the causal probability among all the main fault nodes in the main fault node chain.

Optionally, the acquiring a chain of original failed nodes includes:

acquiring an initial reference fault node in a task node chain;

determining an original fault node chain from the task node chain by iteration based on the reference fault node and the fault similarity between two adjacent task nodes in the task node chain; wherein each iteration process comprises the following steps:

determining at least one candidate node adjacent to the reference fault node from the task node chain;

determining fault similarities between the reference faulty node and the at least one candidate node, respectively;

and selecting at least one original fault node from the at least one candidate node based on the obtained similarity of the faults, and taking the at least one original fault node as a reference fault node.

Optionally, the determining the similarity of the faults between the reference faulty node and the at least one candidate node respectively includes:

for the at least one candidate node, respectively performing the following steps:

acquiring reference node resource attribute information corresponding to the reference fault node in a preset time period and candidate node resource attribute information corresponding to a candidate node in the preset time period;

and determining the fault similarity between the reference fault node and the candidate node based on the reference node resource attribute information and the candidate node resource attribute information.

Optionally, the determining the fault similarity between the reference faulty node and the candidate node based on the reference node resource attribute information and the candidate node resource attribute information includes:

determining a reference node resource attribute image based on the reference node resource attribute information;

determining a candidate node resource attribute image based on the candidate node resource attribute information;

determining the image similarity of the reference node resource attribute image and the candidate node resource attribute image by adopting a similarity network model;

and taking the image similarity as the fault similarity of the reference fault node and the candidate node.

Optionally, the sequentially selecting a primary fault node from the multiple primary fault nodes along the extending direction of the original fault node chain includes:

acquiring an initial reference main fault node from the original fault node chain;

iteratively selecting a main fault node from the plurality of original fault nodes along the extending direction of the original fault node chain based on the reference main fault node, wherein each iteration process comprises the following steps:

if the reference main fault node is not a forked fault node, taking the reference main fault node as a main fault node, and taking an original fault node adjacent to the main fault node in the extending direction of the original fault node chain as the reference main fault node;

and if the reference main fault node is a bifurcation fault node, taking the reference main fault node as the main fault node, and selecting one sub fault node from the plurality of sub fault nodes as the reference main fault node based on the causal probability between the bifurcation fault node and the corresponding plurality of sub fault nodes.

Optionally, the selecting, based on causal probabilities between the bifurcated faulty node and the corresponding multiple sub-faulty nodes, one sub-faulty node from the multiple sub-faulty nodes as the reference primary faulty node includes:

for the plurality of sub-failure nodes, respectively executing the following steps: determining a causal probability between the forked fault node and one sub-fault node based on target resource abnormal information corresponding to the forked fault node and the sub-fault node within a preset time period;

and determining the maximum causal probability in the obtained multiple causal probabilities, and taking the sub-fault node corresponding to the maximum causal probability in the multiple sub-fault nodes as the reference main fault node.

Optionally, the determining a target root cause fault node from the primary fault node chain based on the causal probability between the primary fault nodes in the primary fault node chain includes:

acquiring an initial reference root fault node from the main fault node chain;

and iteratively updating the reference root cause fault node based on the causal probability between the reference root cause fault node and other main fault nodes in the main fault node chain until iteration is finished, and taking the reference root cause fault node as a target root cause fault node.

Optionally, each iteration process comprises the following steps:

acquiring a main fault node from the other main fault nodes;

determining a causal probability between the reference root fault node and the one main fault node based on the target resource abnormal information corresponding to the reference root fault node and the target resource abnormal information corresponding to the one main fault node;

if the causal probability is larger than a preset causal threshold value, the reference root fault node is kept unchanged;

otherwise, the main fault node is used as the reference root fault node.

Optionally, the target resource exception information includes a target resource exception time point, a target resource exception amplitude, and a target resource exception duration.

In one aspect, an embodiment of the present application provides a fault detection apparatus, including:

an original fault node chain obtaining module, configured to obtain an original fault node chain, where the original fault node chain includes a plurality of original fault nodes;

a main fault node chain obtaining module, configured to sequentially select a main fault node from the multiple original fault nodes along an extending direction of the original fault node chain, and determine a main fault node chain based on each selected main fault node;

and the target determining module is used for determining a target root fault node from the main fault node chain based on the causal probability among all the main fault nodes in the main fault node chain.

Optionally, the original failed node chain obtaining module is specifically configured to:

acquiring an initial reference fault node in a task node chain;

Optionally, the main failure node chain obtaining module is specifically configured to:

Optionally, the target determination module is specifically configured to:

acquiring an initial reference root fault node from the main fault node chain;

Optionally, the target determination module is specifically configured to:

each iteration process comprises the following steps:

acquiring a main fault node from the other main fault nodes;

otherwise, the main fault node is used as the reference root fault node.

In one aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the fault detection method when executing the program.

In one aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program executable by a computer device, and when the program runs on the computer device, the computer device is caused to execute the steps of the above fault detection method.

In the embodiment of the application, because the original fault node chain contains more original fault nodes, the efficiency of fault detection can be effectively improved by selecting the main fault node from the original fault nodes and determining the main fault node chain based on each selected main fault node. And finally, determining a target root fault node from the main fault node chain according to the causal probability among all the main fault nodes in the main fault node chain, thereby effectively improving the accuracy of fault detection.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a fault detection method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a task node chain structure provided in an embodiment of the present application;

fig. 4 is a schematic flowchart of a method for acquiring an original failure node chain according to an embodiment of the present application;

fig. 5 is a schematic diagram of a task node chain structure provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of a method for determining similarity of faults according to an embodiment of the present disclosure;

fig. 7 is a schematic flowchart of a method for determining similarity of faults according to an embodiment of the present application;

fig. 8 is a schematic diagram of a reference node resource attribute image according to an embodiment of the present application;

fig. 9 is a schematic view of a candidate node resource attribute image according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a similarity network model according to an embodiment of the present application;

FIG. 11 is a schematic diagram of CPU utilization according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an original failed node chain according to an embodiment of the present application;

fig. 13 is a schematic flowchart of a method for determining a reference root fault node according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a main failure node chain according to an embodiment of the present application;

fig. 15 is a schematic flowchart of a fault detection method according to an embodiment of the present application;

FIG. 16 is a block diagram of a system architecture according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a fault detection apparatus according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, it is a system architecture diagram applicable to the embodiment of the present application, where the system architecture at least includes a terminal device 101, a fault detection system 102, and a service system 103, where the fault detection system 102 may be independent of the service system 103 or may be built in the service system 103.

The terminal apparatus 101 is installed with a failure detection target application, which may be a client installed in advance, a web page version application, or an applet embedded in another application, or the like. The terminal device 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like.

The fault detection system 102 serves a background server for the target application. The fault detection system 102 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.

The service system 103 includes a plurality of task nodes, and when a task node fails, the task node is an original failed node. The service system 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.

The terminal device 101 and the fault detection system 102 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The fault detection system 102 and the service system 103 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The terminal apparatus 101 generates a failure detection instruction to the failure detection system 102 in response to a failure detection operation by the user. The fault detection system 102 receives the fault detection instruction, and acquires an original fault node chain from the service system 103, where the original fault node chain includes a plurality of original fault nodes. The fault detection system 102 sequentially selects a main fault node from the plurality of original fault nodes along the extending direction of the original fault node chain, and determines the main fault node chain based on each selected main fault node. And finally, determining a target root source fault node from the main fault node chain based on the causal probability among all the main fault nodes in the main fault node chain.

Based on the system architecture diagram shown in fig. 1, the present application provides a flow of a fault detection method, as shown in fig. 2, where the flow of the method is executed by a computer device, which may be the fault detection system 102 shown in fig. 1, and includes the following steps:

step S201, an original failed node chain is acquired.

Specifically, a chain of pristine failed nodes includes a plurality of pristine failed nodes.

Based on the resource attribute information corresponding to each task node, whether the task node is an original fault node can be judged. The resource attribute information includes CPU utilization, memory occupancy, disk utilization, QPS, and TPS.

For example, as shown in fig. 3, the service system includes 8 task nodes, which are task node a, task node B, task node C, task node D, task node E, task node F, task node G, and task node H. The resource attribute information corresponding to the 8 task nodes is judged, and the task node B, the task node D, the task node E, the task node F and the task node G are determined to have faults, so that the task node B, the task node D, the task node E, the task node F and the task node G are original fault nodes, and an original fault node chain is formed by the 5 original fault nodes.

Step S202, main fault nodes are sequentially selected from a plurality of original fault nodes along the extending direction of the original fault node chain, and the main fault node chain is determined based on each selected main fault node.

Specifically, the original failed node chain may include one master failed node chain and at least one slave failed node chain, or may include only one master failed node chain.

The extending direction of the original fault node chain can be from left to right, or from right to left, or from top to bottom, or from bottom to top, or any other direction.

For example, the extending direction of the original fault node chain is from left to right, and for the original fault node chain in fig. 3, main fault nodes sequentially selected from 5 original fault nodes are set as a task node B, a task node D, and a task node E. The above 3 main fault nodes constitute a main fault node chain.

Step S203, determining a target root fault node from the main fault node chain based on the causal probability among all the main fault nodes in the main fault node chain.

For example, as shown in fig. 3, the primary failure nodes are task node B, task node D, and task node E. According to a possible implementation mode, the causal probability between the task node B and the task node D, the causal probability between the task node B and the task node E and the causal probability between the task node D and the task node E are calculated, the obtained 3 causal probabilities are compared, and finally the target root source fault node is determined from the 3 main fault nodes.

Optionally, in step S201, acquiring the original failed node chain includes the following steps in fig. 4:

step S401, an initial reference fault node in a task node chain is obtained.

According to a possible implementation mode, when a task node alarm occurs in a service system, the task node with the task node alarm can be directly used as a reference fault node.

For example, as shown in fig. 5, the service system includes 5 task nodes, which are task node a, task node B, task node C, task node D, and task node E. When a task node alarm occurs in the service system, the task node B is set to send the task node alarm, and the task node B is used as a reference fault node.

In another possible implementation manner, when a task monitoring alarm occurs in the service system, the task currently being executed by the service system is determined, and a task node which is the first task to be executed in the service system is used as a reference fault node.

For example, as shown in fig. 5, the service system includes 5 task nodes, which are task node a, task node B, task node C, task node D, and task node E. When a task monitoring alarm occurs in the service system, determining that a task currently executed by the service system is a transaction data storage task, determining that a task node which executes the transaction data storage task first in the service system is a task node A, and taking the task node A as a reference fault node.

Step S402, based on the reference fault node and the fault similarity between two adjacent task nodes in the task node chain, the original fault node chain is determined from the task node chain in an iteration mode.

Each iteration process comprises the following steps:

at least one candidate node adjacent to the reference fault node is determined from the task node chain. And then determining the fault similarity between the reference fault node and at least one candidate node, selecting at least one original fault node from the at least one candidate node based on the obtained fault similarities, and taking the at least one original fault node as the reference fault node.

Specifically, a fault similarity threshold is set, and the following judgments are respectively performed for each obtained fault similarity:

if the fault similarity between the reference fault node and the candidate node is smaller than the fault similarity threshold value, the candidate node is not the original fault node; otherwise, the candidate node is represented as the original fault node.

The iteration stop condition includes the following two possible ways: the first iteration stop condition is that the original fault node is not selected from at least one candidate node. And in a second iteration stop condition, the determined original fault node exceeds a preset number threshold.

For example, as shown in fig. 5, the service system includes 5 task nodes, which are respectively a task node a, a task node B, a task node C, a task node D, and a task node E, and the above 5 task nodes form a task node chain. And setting the task node B as a reference fault node.

And determining three candidate nodes adjacent to the reference fault node, namely a task node A, a task node C and a task node D. The fault similarity between the reference fault node and the task node A is set to be 0.3, the fault similarity between the reference fault node and the task node C is set to be 0.6, the fault similarity between the reference fault node and the task node D is set to be 0.7, and the fault similarity threshold is set to be 0.4.

Since 0.3 is less than the failure similarity threshold 0.4, task node A is not the original failed node. Since 0.6 is greater than the fault similarity threshold 0.4 and 0.7 is greater than the fault similarity threshold 0.4, task node C and task node E are both original fault nodes.

And then the task node C is used as a reference fault node, and the iteration for the reference fault node as the task node C is finished because the reference fault node has no adjacent candidate node.

And then taking the task node D as a reference fault node, determining a candidate node adjacent to the reference fault node as a task node E, and setting the fault similarity between the reference fault node and the task node E to be 0.2, wherein the task node E is not the original fault node because 0.2 is smaller than a fault similarity threshold value of 0.4.

And finally, the determined original fault nodes are a task node B, a task node C and a task node D, and an original fault node chain is formed by the 3 original fault nodes.

In the embodiment of the application, at least one original fault node is determined from at least one candidate node by comparing the fault similarity between the reference fault node and at least one adjacent candidate node, so that the original fault node can be effectively prevented from being omitted, and the accuracy of fault detection can be improved. Meanwhile, the fault similarity judgment is carried out on the reference fault node and the adjacent candidate nodes every time, so that the judgment complexity is simplified, and the accuracy of the fault similarity judgment is improved.

Optionally, in step S402, to determine the fault similarity between the reference faulty node and at least one candidate node, the following steps in fig. 6 are performed for the at least one candidate node:

step S601, reference node resource attribute information corresponding to a reference fault node in a preset time period and candidate node resource attribute information corresponding to a candidate node in the preset time period are obtained.

Specifically, the preset time period may be determined according to a failure time point and a preset time period.

For example, the failure time point is 10:05:00, the preset time duration is 2 minutes, and the preset time period may be 10:05:00-10:07:00, may also be 10:03:00-10:05:00, and may also be 10:04:00-10:06: 00.

The resource attribute information comprises CPU utilization rate, memory occupancy rate, disk utilization rate, QPS and TPS.

The reference node resource attribute information is resource attribute information corresponding to a reference fault node in a preset time period, and the candidate node resource attribute information is resource attribute information corresponding to a candidate node in the preset time period.

Step S602, based on the resource attribute information of the reference node and the resource attribute information of the candidate node, determining the fault similarity between the reference fault node and one candidate node.

Specifically, as shown in fig. 7, determining the fault similarity between the reference faulty node and a candidate node includes the following steps:

step S701, determining a reference node resource attribute image based on the reference node resource attribute information.

Step S702, determining a candidate node resource attribute image based on the candidate node resource attribute information.

For example, the reference fault node is a task node B, the candidate node is a task node C, the reference node resource attribute information corresponding to the reference fault node in the preset time period of 10:00:00-10:02:00 is obtained, and the candidate node resource attribute information corresponding to the candidate node in the preset time period of 10:00:00-10:02:00 is obtained.

Based on the reference node resource attribute information, a reference node resource attribute image is determined, as shown in fig. 8. Based on the candidate node resource attribute information, a candidate node resource attribute image is determined, as shown in fig. 9.

Step S703, determining the image similarity of the reference node resource attribute image and the candidate node resource attribute image by adopting a similarity network model.

Specifically, the similarity network model comprises two feature extraction modules and a similarity judgment module. The two feature extraction modules are respectively a first feature extraction module and a second feature extraction module, and the first feature extraction module and the second feature extraction module are completely the same.

And inputting the reference node resource attribute image into a first feature extraction module to obtain the reference image feature. And simultaneously, inputting the candidate node resource attribute image to a second feature extraction module to obtain candidate image features. And inputting the reference image characteristics and the candidate image characteristics to a similarity judgment module to obtain the image similarity of the reference node resource attribute image and the candidate node resource attribute image.

The first feature extraction module comprises a plurality of different convolution modules and a data flattening layer, and each convolution module comprises at least one convolution layer and at least one down-sampling layer. The second feature extraction module includes a plurality of different convolution modules, each convolution module including at least one convolution layer and at least one downsampling layer, and a data flattening layer. The similarity judging module comprises a characteristic difference layer, at least one full connection layer and a normalization layer. The output of the normalization layer is a value between 0 and 1.

For example, the similarity network model is shown in fig. 10, and the first feature extraction module includes 3 different convolution modules and a data flattening layer. The second feature extraction module includes 3 convolution modules and a data flattening layer. The similarity judging module comprises a characteristic difference layer, a full connection layer and a normalization layer.

Inputting the reference node resource attribute image to a first convolution module C1 in the first feature extraction module to obtain a reference image feature f 11; inputting the reference image feature f11 into a second convolution module C2 in the first feature extraction module to obtain a reference image feature f 12; and inputting the reference image feature f12 to a third convolution module C3 in the first feature extraction module to obtain a reference image feature f13, and finally inputting the reference image feature f13 to a data flattening layer in the first feature extraction module to obtain a reference image feature f 14.

Meanwhile, inputting the candidate node resource attribute image to a first convolution module C1 in a second feature extraction module to obtain a candidate image feature f 21; inputting the candidate image feature f21 into a second convolution module C2 in the second feature extraction module to obtain a candidate image feature f 22; and inputting the candidate image feature f22 to a third convolution module C3 in the second feature extraction module to obtain a candidate image feature f23, and finally inputting the candidate image feature f23 to a data flattening layer in the second feature extraction module to obtain a candidate image feature f 24.

Inputting the reference image feature f14 and the candidate image feature f24 into a feature difference layer in a similarity judgment module to obtain an image difference feature f 3; inputting the image difference value characteristic f3 into the full-connection layer to obtain an image difference value characteristic f 4; and inputting the image difference value characteristic f4 into a normalization layer, and finally obtaining the image similarity of the reference node resource attribute image and the candidate node resource attribute image.

Step S704, the image similarity is used as the fault similarity between the reference fault node and a candidate node.

In the embodiment of the application, after acquiring the reference node resource attribute information corresponding to the reference fault node in a preset time period and the candidate node resource attribute information corresponding to one candidate node in the preset time period, determining a reference node resource attribute image and a candidate node resource attribute image, judging the image similarity of the reference node resource attribute image and the candidate node resource attribute image through a similarity network model, and taking the image similarity as the fault similarity of the quasi-fault node and the candidate node. Because the fault similarity judgment is directly carried out according to the quasi-node resource attribute information and the candidate node resource attribute information, the prior knowledge of operation and maintenance personnel is relied to a great extent, the low accuracy and the low efficiency of the fault similarity judgment can be caused, the two kinds of information are converted into images, and then the fault similarity judgment is carried out through the similarity network model, so that the accuracy of the fault similarity judgment is enhanced, and meanwhile, the practicability of the fault similarity judgment is improved.

Because fault similarity judgment is carried out based on the similarity network model, no limitation is imposed on resource attribute information, and therefore the application range is wider. When the resource attribute information is increased, the cost is not increased, and the popularization of the fault detection system can be quickly realized under the condition of low cost.

Optionally, in the step S202, sequentially selecting a primary failure node from a plurality of original failure nodes along the extending direction of the original failure node chain, the method includes the following steps:

and acquiring an initial reference main fault node from the original fault node chain. And based on the reference main fault node, selecting the main fault node from the plurality of original fault nodes in an iteration mode along the extending direction of the original fault node chain.

Each iteration process comprises the following steps:

and if the reference main fault node is not the bifurcation fault node, taking the reference main fault node as the main fault node, and taking the original fault node adjacent to the main fault node in the extending direction of the original fault node chain as the reference main fault node.

And if the reference main fault node is the bifurcation fault node, taking the reference main fault node as the main fault node, and selecting one sub fault node from the plurality of sub fault nodes as the reference main fault node based on the causal probability between the bifurcation fault node and the corresponding plurality of sub fault nodes.

Specifically, for a plurality of sub-failed nodes, the following steps are respectively performed: and determining the causal probability between the bifurcation fault node and one sub fault node respectively based on the target resource abnormal information corresponding to the bifurcation fault node and one sub fault node respectively in a preset time period.

And finally, determining the maximum causal probability in the obtained multiple causal probabilities, and taking the sub fault node corresponding to the maximum causal probability in the multiple sub fault nodes as a reference main fault node.

The target resource abnormity information comprises a target resource abnormity time point, a target resource abnormity amplitude and a target resource abnormity duration. The resource attribute information comprises CPU utilization rate, memory occupancy rate, disk utilization rate, QPS and TPS.

Determining target resource exception information, including the following two embodiments:

in a first possible implementation manner, one of the resource attribute information is selected as the target resource attribute information, and then the target resource abnormal information is determined according to the target resource attribute information.

In a second possible implementation manner, for each resource attribute information, a resource abnormal amplitude corresponding to each resource attribute information is determined, the resource attribute information corresponding to the maximum resource abnormal amplitude is determined as target resource attribute information, and then the target resource abnormal information is determined according to the target resource attribute information.

Specifically, the causal probability between a forked fault node and one sub-fault node is expressed by the following formula 1:

the Prob represents the causal probability between a bifurcation fault node and a sub fault node, t1 represents a target resource abnormal time point corresponding to the bifurcation fault node, A1 represents a target resource abnormal amplitude corresponding to the bifurcation fault node, D1 represents a target resource abnormal duration corresponding to the bifurcation fault node, t2 represents a target resource abnormal time point corresponding to a sub fault node, A2 represents a target resource abnormal amplitude corresponding to a sub fault node, and D2 represents a target resource abnormal duration corresponding to a sub fault node.

For example, the target resource attribute information is set as the CPU utilization, and the preset time period is 10:00:00-10:02: 00. As shown in fig. 11, the CPU utilization rates of the forked failure node in the preset time period, the CPU utilization rate of the sub-failure node 1 in the preset time period, and the CPU utilization rate of the sub-failure node 2 in the preset time period are included.

For the CPU utilization rate corresponding to the forked fault node in the preset time period, the abnormal time point of the CPU utilization rate is 10:00:12, the abnormal amplitude of the CPU utilization rate is 100% minus 40%, namely 60%, and the abnormal duration time period of the CPU utilization rate is 10:02:00 minus 10:00:12, namely 108 s.

For the corresponding CPU utilization rate of the sub-fault node 1 in the preset time period, the abnormal time point of the CPU utilization rate is 10:00:15, the abnormal amplitude of the CPU utilization rate is 85% minus 25%, namely 60%, and the abnormal duration of the CPU utilization rate is 10:02:00 minus 10:00:15, namely 105 s.

For the corresponding CPU utilization rate of the sub-fault node 2 in the preset time period, the abnormal time point of the CPU utilization rate is 10:00:15, the abnormal amplitude of the CPU utilization rate is 50% minus 30%, namely 20%, and the abnormal duration time period of the CPU utilization rate is 10:02:00 minus 10:00:15, namely 105 s.

According to the formula (1), the causal probability between the bifurcation fault node and the sub-fault node 1 is determined as

According to equation (1), the causal probability between the forked fault node and the sub-fault node 2 is determined as

Since the causal probability 0.0162 between the split fault node and the sub fault node 1 is greater than the causal probability 0.0054 between the split fault node and the sub fault node 2, the sub fault node 1 is taken as the reference primary fault node.

For example, as shown in fig. 12, the original fault node chain includes 5 original fault nodes, which are original fault node 1, original fault node 2, original fault node 3, original fault node 4, and original fault node 5, respectively.

Setting the extending direction of an original fault node chain as from left to right, acquiring an original fault node 1 from the original fault node chain as an initial reference main fault node, and taking the reference main fault node as a main fault node and taking an original fault node 2 as a reference main fault node because the reference main fault node is not a bifurcation fault node.

When the original fault node 2 is used as a reference main fault node, the reference main fault node is used as a bifurcation fault node, the reference main fault node is used as a main fault node, and 2 sub fault nodes corresponding to the reference main fault node are determined to be an original fault node 3 and an original fault node 4 respectively. And determining the cause-effect probability between the reference fault node and the original fault node 3 as prob23, determining the cause-effect probability between the reference fault node and the original fault node 4 as prob24, and setting the cause-effect probability prob23 to be smaller than the cause-effect probability prob24, so that the original fault node 4 is taken as a reference main fault node.

When the original failed node 4 is taken as the reference primary failed node, since the reference primary failed node is not the forking failed node, the reference primary failed node is taken as the primary failed node, and the original failed node 5 is taken as the reference primary failed node.

When the original fault node 5 is taken as the reference primary fault node, since the reference primary fault node is not the forking fault node, the reference primary fault node is taken as the primary fault node, and the process is ended.

Finally, the original fault node 1, the original fault node 2, the original fault node 4 and the original fault node 5 are all main fault nodes, and a main fault node chain is formed.

In the embodiment of the application, because the original fault node chain contains more original fault nodes, the efficiency of fault detection can be effectively improved by selecting the main fault node from the original fault nodes and determining the main fault node chain based on each selected main fault node.

When the reference main fault node is the bifurcation fault node, respectively determining the causal probability between the bifurcation fault node and the plurality of sub fault nodes, wherein the greater the causal probability is, the stronger the causal property between the bifurcation fault node and the sub fault nodes is, therefore, selecting the sub fault node corresponding to the maximum causal probability as the reference main fault node can reasonably filter the sub fault nodes with weak causal property, and ensure the accuracy of the selected main fault node, namely the accuracy of the main fault node chain.

Optionally, in the step S203, determining the target root fault node from the primary fault node chain based on the causal probability between the primary fault nodes in the primary fault node chain includes the following steps:

acquiring an initial reference root fault node from a main fault node chain; and iteratively updating the reference root fault node based on the causal probability between the reference root fault node and other main fault nodes in the main fault node chain until iteration is finished, and taking the reference root fault node as a target root fault node.

Each iteration process includes the following steps in fig. 13:

step S1301, a primary failure node is obtained from other primary failure nodes.

Step S1302, determining a causal probability between a reference root fault node and a main fault node based on the target resource abnormality information corresponding to the reference root fault node and the target resource abnormality information corresponding to the main fault node.

Specifically, the target resource abnormality information includes a target resource abnormality time point, a target resource abnormality amplitude, and a target resource abnormality duration. The determination method of the target resource abnormal information is the same as the above.

The causal probability between a reference root fault node and a main fault node is shown in formula 1, where Prob represents the causal probability between the reference root fault node and the main fault node, t1 represents a target resource abnormal time point corresponding to the reference root fault node, a1 represents a target resource abnormal amplitude corresponding to the reference root fault node, D1 represents a target resource abnormal duration corresponding to the reference root fault node, t2 represents a target resource abnormal time point corresponding to the main fault node, a2 represents a target resource abnormal amplitude corresponding to the main fault node, and D2 represents a target resource abnormal duration corresponding to the main fault node.

Step S1303, if the causal probability is greater than a preset causal threshold, executing step S1304; otherwise, step S1305 is executed.

In step S1304, the reference root cause failed node remains unchanged.

Step S1305, using one master failure node as a reference root failure node.

For example, as shown in fig. 14, the main failure node chain includes a main failure node 1, a main failure node 2, a main failure node 3, and a main failure node 4. And acquiring a main fault node 1 from the main fault node chain as an initial reference root fault node, wherein other main fault nodes comprise a main fault node 2, a main fault node 3 and a main fault node 4. The causal probability threshold is set to 0.45.

The main fault node 1 is a reference root fault node, and one main fault node, namely the main fault node 2, is obtained from other main fault nodes. The causal probability between the reference root cause fault node and the primary fault node 2 is set to be 0.3, and the primary fault node 2 is used as the reference root cause fault node because 0.3 is smaller than the causal probability threshold value of 0.45.

The main fault node 2 is a reference root fault node, and one main fault node, namely the main fault node 3, is obtained from other main fault nodes. The causal probability between the reference root cause failure node and the primary failure node 3 is set to 0.7, and the reference root cause failure node remains unchanged because 0.7 is greater than the causal probability threshold of 0.45.

The main fault node 2 is a reference root fault node, and one main fault node, namely the main fault node 4, is obtained from other main fault nodes. The causal probability between the reference root cause failure node and the primary failure node 4 is set to 0.6, and since 0.6 is greater than the causal probability threshold value of 0.45, the reference root cause failure node remains unchanged.

Finally, a reference root cause failure node, i.e., the primary failure node 2, is determined as a target root cause failure node.

In the embodiment of the application, one main fault node is selected from other main fault nodes, and the causal probability between the reference root fault node and the main fault node is determined each time, so that the fault detection efficiency can be effectively improved. And finally, after the reference root fault node and each main fault node in other main fault nodes are compared, the target root fault node is determined, and the effect of improving the fault detection accuracy is realized.

In order to better explain the embodiment of the present application, a fault detection method provided by the embodiment of the present application is described below with reference to specific implementation scenarios, as shown in fig. 15, including the following steps:

step S1501, an initial reference fault node in the task node chain is obtained.

Step S1502 determines an original faulty node chain from the task node chain by iteration based on the reference faulty node and the fault similarity between two adjacent task nodes in the task node chain.

Step S1503, an initial reference primary failure node is obtained from the original failure node chain.

Step S1504, based on the reference primary failure node, iteratively select a primary failure node from the plurality of primary failure nodes along the extension direction of the original failure node chain.

In step S1505, a primary failure node chain is determined based on the selected primary failure nodes.

Step S1506, obtain the initial reference root fault node from the main fault node chain, and determine other main fault nodes.

Step S1507, judging whether other main fault nodes are empty, if so, executing step S1513; otherwise, step S1508 is executed.

In step S1508, a primary failure node is obtained from other primary failure nodes.

Step S1509, based on the target resource abnormality information corresponding to the reference root cause failure node and the target resource abnormality information corresponding to one main failure node, a causal probability between the reference root cause failure node and one main failure node is determined.

Step S1510, judging whether the causal probability is greater than a preset causal threshold, if so, executing step S1511; otherwise, step S1512 is performed.

Step S1511, the reference root cause failure node remains unchanged.

In step S1512, a primary failure node is used as a reference root failure node.

And step S1513, taking the reference root fault node as a target root fault node.

In the embodiment of the application, because the original fault node chain contains more original fault nodes, the efficiency of fault detection can be effectively improved by selecting the main fault node from the original fault nodes and determining the main fault node chain based on each selected main fault node. And a target root fault node is determined based on the causal probability between the reference root fault node and other main fault nodes, so that the fault detection efficiency is improved, and the fault detection accuracy is ensured.

To better explain the embodiments of the present application, a fault detection method provided by the embodiments of the present application is described below with reference to a specific implementation scenario, where the fault detection method is executed by the fault detection system 102 in fig. 1. As shown in fig. 16, includes a task node chain acquisition module 1601, a monitoring module 1602, an algorithm module 1603, a fault handling module 1604, and a feedback module 1605.

The task node chain obtaining module 1601 obtains a task node chain in the task system 103.

The monitoring module 1602 obtains resource attribute information corresponding to each task node in the task node chain, including CPU utilization, memory occupancy, disk utilization, QPS, and TPS.

Algorithm module 1603 receives the task node chain sent by task node chain obtaining module 1601 and the resource attribute information corresponding to each task node sent by monitoring module 1602, and determines a target root fault node based on the task node chain and the resource attribute information corresponding to each task node.

The fault handling module 1604 performs operations such as restarting processes, restarting applications, or restarting virtual machines according to the target root fault node determined by the algorithm module 1603.

The feedback module 1605 records and analyzes the target root cause fault node determined by the algorithm module 1603, and updates the similarity network model in the algorithm module periodically.

In this embodiment of the present application, the task node chain obtaining module 1601 is capable of obtaining and updating a task node chain in real time without human intervention. The feedback module 1605 records and analyzes each fault detection result, and periodically updates the similarity network model, thereby ensuring the accuracy of fault detection.

Based on the same technical concept, the present embodiment provides a fault detection apparatus, as shown in fig. 17, the fault detection apparatus 1700 includes:

an original failure node chain obtaining module 1701, configured to obtain an original failure node chain, where the original failure node chain includes multiple original failure nodes;

a main failure node chain obtaining module 1702, configured to sequentially select a main failure node from the multiple original failure nodes along an extending direction of the original failure node chain, and determine a main failure node chain based on each selected main failure node;

a target determining module 1703, configured to determine a target root fault node from the main fault node chain based on a causal probability between the main fault nodes in the main fault node chain.

Optionally, the original fault node chain obtaining module 1701 is specifically configured to:

acquiring an initial reference fault node in a task node chain;

Optionally, the original failed node chain obtaining module 1701 is specifically configured to:

Optionally, the primary failure node chain obtaining module 1702 is specifically configured to:

Optionally, the goal determining module 1703 is specifically configured to:

acquiring an initial reference root fault node from the main fault node chain;

Optionally, the goal determining module 1703 is specifically configured to:

each iteration process comprises the following steps:

acquiring a main fault node from the other main fault nodes;

otherwise, the main fault node is used as the reference root fault node.

Based on the same technical concept, the embodiment of the present application provides a computer device, which may be a terminal or a server, as shown in fig. 18, including at least one processor 1801 and a memory 1802 connected to the at least one processor, where a specific connection medium between the processor 1801 and the memory 1802 is not limited in this embodiment of the present application, and the processor 1801 and the memory 1802 in fig. 18 are connected through a bus as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present application, the memory 1802 stores instructions executable by the at least one processor 1801, and the at least one processor 1801 may execute the steps included in the fault detection method by executing the instructions stored in the memory 1802.

The processor 1801 is a control center of the computer device, and may connect various portions of the computer device by using various interfaces and lines, and perform fault detection by executing or executing instructions stored in the memory 1802 and calling data stored in the memory 1802. Optionally, the processor 1801 may include one or more processing units, and the processor 1801 may integrate an application processor and a modem processor, where the application processor mainly handles an operating system, a user interface, application programs, and the like, and the modem processor mainly handles wireless communication. It is to be appreciated that the modem processor described above may not be integrated into the processor 1801. In some embodiments, the processor 1801 and the memory 1802 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 1801 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 1802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1802 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1802 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1802 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program runs on the computer device, causes the computer device to perform the steps of the above-mentioned fault detection method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of fault detection, comprising:

2. The method of claim 1, wherein said obtaining a chain of pristine failed nodes comprises:

acquiring an initial reference fault node in a task node chain;

3. The method of claim 2, wherein said determining a fault similarity between said reference faulty node and said at least one candidate node, respectively, comprises:

4. The method of claim 3, wherein said determining a fault similarity of said reference failed node and said one candidate node based on said reference node resource attribute information and said candidate node resource attribute information comprises:

5. The method of claim 1, wherein said sequentially selecting a primary failure node from the plurality of primary failure nodes along the extending direction of the chain of primary failure nodes comprises:

6. The method of claim 5, wherein the selecting one of the plurality of sub-failed nodes as the reference primary failed node based on causal probabilities between the bifurcated failed node and a corresponding plurality of sub-failed nodes, respectively, comprises:

7. The method of claim 1, wherein the determining a target root cause failure node from the chain of primary failure nodes based on causal probabilities between respective primary failure nodes in the chain of primary failure nodes comprises:

acquiring an initial reference root fault node from the main fault node chain;

8. The method of claim 7, wherein each iterative process comprises the steps of:

acquiring a main fault node from the other main fault nodes;

otherwise, the main fault node is used as the reference root fault node.

9. The method of claim 6 or 8, wherein the target resource exception information includes a target resource exception time point, a target resource exception magnitude, and a target resource exception duration.

10. A fault detection device, comprising:

and the target determining module is used for determining a target root fault node from the main fault node chain based on the causal probability among the main fault nodes in the main fault node chain.

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 1 to 9 are performed when the program is executed by the processor.

12. A computer-readable storage medium, having stored thereon a computer program executable by a computer device, for causing the computer device to perform the steps of the method of any one of claims 1 to 9, when the program is run on the computer device.