CA3140769A1

CA3140769A1 - Method and system for positioning fault root cause of service system

Info

Publication number: CA3140769A1
Application number: CA3140769A
Authority: CA
Inventors: Xuepeng Zhai; Yuxue Bao; Zhiliang Geng
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2020-11-30
Filing date: 2021-11-30
Publication date: 2022-05-30
Also published as: CN112491622B; CN112491622A

Abstract

The present invention discloses to locate root cause of business system failure method and system. The method comprises: calculation engine obtains call-chain messages, sampling call-chain messages according to sampling ratio, grouping and assembling to obtain call-chain, obtaining time consumption of each service interface in the call-chain, filtering out failure service interface according to time consumption comparison table; calculation engine obtains network element indicator messages, calculating first variation range of network element indicator, filtering out failure network element indicator according to first variation range threshold value; calculation engine correlates failure service interface with failure network element indicator according to time and issues failure warning, failure warning comprises failure service interface and failure network element indicator which causes the failure service interface to fail. The system quickly locates root cause of business system failure by combining call-chain log file and network element indicator data through above-mentioned method.

Description

METHOD AND SYSTEM FOR POSITIONING FAULT ROOT CAUSE OF SERVICE SYSTEM
Field [0001] The present disclosure relates to technical field of computer cloud environment operation and maintenance monitoring, particularly to locate root cause of business system failure method and system.
Background

[0002] At present, cloud technology continues to improve, private cloud and public cloud continue to appear, the operation and maintenance monitoring technology is also constantly developing, the software and hardware monitoring technology is also relatively rich, including the monitoring of software and hardware indicators in the cloud environment and the monitoring of the business running in the cloud environment. The monitoring of the software and hardware indicators in the cloud environment, for example: monitoring the CPU usage, the RAM usage, the DISK 10 usage, the NET
10 usage and the number of Redis connections and other network element indicators of the network element through Prometheus; the business monitoring running in the cloud environment, for example, analyzing and marking the range of failure network element through business call-chain data, etc.

[0003] However, the technology that combines the software and hardware indicator monitoring of the cloud environment and service monitoring running in the cloud environment is relative scarce, when the business system encounters failures such as increased time consumption request responses, decreased request success rate, and sudden decrease/surge of TPM, there is no method of correlation analysis of the two monitoring technologies to quickly locate root cause business system failure, for example, the production environment service interface A's successful request rate is low and warning, if using traditional methods to troubleshoot root cause, which needs to manually check the network element indicator log file and call-chain log file to conduct upstream and downstream investigations, because a plurality of indicators are involved which usually requires the collaboration of people from a plurality of fields to find out root Date recue / Date received 202 1-1 1-30 cause of business system failure, also requires time-consuming and labor-intensive.
Invention Content

[0004] The pm-pose of the present invention is to provide a locating root cause of business system failure method and system, by combining the call-chain log and network element indicator data to quickly locate root cause of business system failure.

[0005] To achieve the above purpose, the present invention provides following technical solutions:

[0006] A method for locating root cause of business system failure, comprising:

[0007] Calculation engine obtains call-chain messages, sampling the call-chain messages according to sampling ratio, grouping and assembling to obtain the call-chain, obtaining time consumption of each service interface in the call-chain, and filtering out failure service interface according to time consumption comparison table;

[0008] Calculation engine obtains network element indicator messages, calculating first variation range of the network element indicator, filtering out failure network element indicator according to the first variation range threshold value;

[0009] Calculation engine correlates the failure service interface with the failure network element indicator according to time and issues failure warning, the failure warning comprises failure service interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail.

[0010] Preferably, the method of calculation engine obtains call-chain messages and network element Date recue / Date received 202 1-1 1-30 indicator messages comprises:

[0011] Network element periodically generates network element indicator messages and stores in indicator log file;

[0012] Generating call-chain messages when service interface deployed by network element is called and storing in call-chain unit log file;

[0013] Network element uses log file collection module to push newly added network element indicator messages of the indicator log file and newly added call-chain messages of the call-chain unit log file to distributed publish and subscribe message system;

[0014] Calculation engine reads call-chain messages and network element indicator messages from the distributed publish and subscribe message system.

[0015] Preferably, method of calculation engine filters out the failure service interface comprises:

[0016] Obtaining and parsing the call-chain messages to get call-chain ID, service interface name identification and event corresponding to call-chain message, meanwhile, obtaining call-chain entry message;

[0017] Configuring sampling ratio of the call-chain messages, obtaining call-chain entry message sample according to the sampling ratio;

[0018] According to call-chain ID and event time to filter out call-chain messages belonging to the same group as the call-chain entry message sample, assembling all call-chain message of the same group to get call-chain information;

Date recue / Date received 202 1-1 1-30

[0019] Obtaining time consumption of each service interface in call-chain and filtering out failure service interface according to time consumption comparison table.

[0020] Specifically, the method of parsing the call-chain messages to get call-chain entry message comprises:

[0021] Setting expiration time of the call-chain messages, filtering and parsing unexpired call chain messages according to the expiration time, obtaining call-chain ID, service interface name identification and event time of the call-chain messages;

[0022] Filtering out call-chain entry message in the call-chain messages according to call-chain ID and service interface name identification.

[0023] Furthermore, the method of filtering out failure service interface according to time consumption comparison table comprises:

[0024] Pre-calculating average time consumption corresponding to service interface, storing service interface name identification of the service interface and the average time in the time consumption comparison table with one-to-one correspondence;

[0025] Obtaining service interface in call-chain, wherein the service interface takes more time than the average time in the time consumption comparison table, storing the service interface as alternative failure service interface;

[0026] Calculating failure credibility of alternative failure service interface according to actual time and correspondingly average time of the alternative failure service interface, filtering out alternative failure Date recue / Date received 202 1-1 1-30 service interface with greater failure credibility than pre-set credibility threshold and storing as alternative failure service interface.

[0027] Preferably, the method of calculation engine filters out failure network element comprises:

[0028] Pre-calculating sample mean and stability of network element, and correspondingly storing indicator name identification, stability and sample mean of network element in stability comparison table;

[0029]
Obtaining and parsing network element indicator messages, obtaining indicator name identification and current value of the network element indicator;

[0030] Searching for correspondingly sample mean and stability from the stability comparison table according to the indicator name identification;

[0031] According to the current value of network indicator and the correspondingly sample mean and stability, calculating first variant range of network element indicator, storing the network element indicator with greater first variant range than pre-set first variant threshold as failure network element indicator.

[0032] Specifically, the method of pre-calculating sample mean and stability of network element comprises:

[0033] Collecting x network element indicator samples;

[0034] Calculating mean value of the network element indicator samples and second variant range of each network element indicator sample, counting quantity n of failure network element indicator samples with greater second variant range than pre-set second variant threshold;
Date recue / Date received 202 1-1 1-30

[0035] Calculating stability m of network element indicator, wherein, m = x¨n.
x

[0036] Preferably, the method of calculation engine filters out failure network element indicator also comprises:

[0037] Obtaining pre-set quantity of network element indicator value before current network element indicator value as recent network element indicator value;

[0038] Calculating recent mean value according to the recent network element indicator value;

[0039] Calculating recent variant range of network element indicator according to the recent mean value and the current value of network element indicator;

[0040] Calculating sample variant range of network element indicator according to the sample mean and the current value of network element indicator;

[0041] After weighted calculation based on the recent variant range and the sample variant range of network element indicator, combining with stability of network element indicator to obtain the first variant range of network element indicator.

[0042] Preferably, indicator name identification and indicator value of network element indicator are stored in a sliding matrix based on time, wherein, each column of the sliding matrix corresponds to a minute span, each row corresponds to an indicator value identified by an indicator name identification; or each row corresponds to a minute span, each column corresponds to an indicator value identified by an indicator name identification.

[0043] A system for locating root cause of business system failure, wherein, comprising a calculation Date recue / Date received 202 1-1 1-30 engine, a distributed publish and subscribe message system and a network element, the calculation engine comprises a message reading module, a call-chain processing module, a network element indicator processing module, and a failure warning module, wherein,

[0044] The distributed publish and subscribe message system is configured to store network element indicators and generated call-chain messages when service interface deployed by the network element is called;

[0045] The message reading module is configured to read call-chain messages and network element indicator messages from the distributed publish and subscribe message system;

[0046] The call-chain processing module is configured to sampling the call-chain messages according to sampling ratio, grouping and assembling to obtain the call-chain, obtaining time consumption of each service interface in the call-chain, and filtering out failure service interface according to time consumption comparison table;

[0047] The network element indicator processing module is configured calculate first variation range of the network element indicator and filter out failure network element indicator according to the first variation range threshold value;

[0048] The failure warning module is configured to correlate the failure service interface with the failure network element indicator according to time and issues failure warning, the failure warning comprises failure service interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail.

[0049] Comparing with the prior art, the method and system for locating root cause of business system failure provided by the present invention has the following beneficial effects:

Date recue / Date received 202 1-1 1-30

[0050] The method for locating root cause of business system failure provided by the present invention, after calculation engine obtains call-chain messages and network element indicator, performing failure analysis respectively, filtering out failure service interface and failure network element indicator, then correlating failure service interface with failure network element indicator according to time, to identify and warn the root cause of service failure, in other words, the failure service interface and the failure network element indicator which caused the failure interface to fail reduce the complexity of troubleshooting for operation and maintenance personnel, and save the time and labor cost for troubleshooting.

[0051] The system for locating root cause of business system failure provided by the present invention adopts the above-mentioned method for locating root cause of business system failure, by combining call-chain log and network element indicator data to quickly locate the root cause of business system failure.
Drawing Description

[0052] The drawings described here are used to provide further understandings of the present invention and constitute a part of the present invention. The illustrated exemplary implementations and descriptions are used to explain the present invention , and do not constitute an improper limitation of the present invention. For the attached figures:

[0053] Figure 1 is a process diagram of a method for locating root cause of business system failure in the implementation of the present invention;

[0054] Figure 2 is a task logic diagram of computer engine in the implementation of the present invention;

[0055] Figure 3 is a system architecture diagram of locating root cause of business system failure in the Date recue / Date received 202 1-1 1-30 implementation of the present invention.
Specific implementation methods

[0056] In order to make clearer purpose, technical solutions and benefits of the present invention, the following will clearly and completely describe the technical solutions of the implementations in the present application with accompanying drawings, obviously the described implementations are only a part of the implementations in the present application. Based on the implementations in the present application, all other implementations obtained by those of ordinary skilled in the art will fall in the protection scope of the present application.

[0057] Implementation one

[0058] Please refer to Figure 1, a method for locating root cause of business system failure provided by the implementation of the present invention, comprising:

[0059] Calculation engine obtains call-chain messages, sampling the call-chain messages according to sampling ratio, grouping and assembling to obtain the call-chain, obtaining time consumption of each service interface in the call-chain, and filtering out failure service interface according to time consumption comparison table;

[0060] Calculation engine obtains network element indicator messages, calculating first variation range of the network element indicator, filtering out failure network element indicator according to the first variation range threshold value;

[0061] Calculation engine correlates the failure service interface with the failure network element indicator according to time and issues failure warning, the failure warning comprises failure service Date recue / Date received 202 1-1 1-30 interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail.

[0062] The method for locating root cause of business system failure provided by the implementation the present invention, after calculation engine obtains call-chain messages and network element indicator, performing failure analysis respectively, filtering out failure service interface and failure network element indicator, then correlating failure service interface with failure network element indicator according to time, to identify and warn the root cause of service failure, in other words, the failure service interface and the failure network element indicator which caused the failure interface to fail reduce the complexity of troubleshooting for operation and maintenance personnel, and save the time and labor cost for troubleshooting.

[0063] Please refer to Figure 2 or Figure 3, the method for locating root cause of business system failure provided by the implementation of the present invention, the method of calculation engine obtains call-chain messages and network element indicator messages comprises:

[0064] Network element periodically generates network element indicator messages and stores in indicator log file;

[0065] Generating call-chain messages when service interface deployed by network element is called and storing in call-chain unit log file;

[0066] Network element uses log file collection module to push newly added network element indicator messages of the indicator log file and newly added call-chain messages of the call-chain unit log file to distributed publish and subscribe message system;

[0067] Calculation engine reads call-chain messages and network element indicator messages from the Date recue / Date received 202 1-1 1-30 distributed publish and subscribe message system.

[0068] Those skilled in art should now that the network element refers to the basic service unit in cloud environment and has globally unique identification name, IP and other information, for example: docker container with identification name docker 001; network element indicator refers to specific indicator monitored in the network element which has indicator name, indicator value at some time point, such as:
RAM usage, CPU usage, Redis connections, etc.; call-chain refers to call system, network element, and key embedding method through which a business request passes, the main dimensions involved include:
network element unique identification, service interface name, call-chain ID, service interface ID (unique in the call-chain), caller service ID, start time, service interface execution time, success and failure identification, etc.

[0069] In specific implementation, the network element periodically (1 second, 30 seconds, 60 seconds, 300 seconds, etc., the time period can be customized) generates monitoring network element indicator information storing in the indicator log file, and the log collection module (such as flume) integrated in the network element can monitor the changes in the indicator log file, then pushing the new added network element indicator messages in the indicator log to Kafka and other distributed publish and subscribe message system, wherein, indicator log file is rolling generation, historical indicator log file will be periodically cleared; network element has deployed service interface, the service interface generates call-chain messages when business is called, and storing the messages in the unit log of call-chain., the unit log of call-chain is also pushed to distributed publish and subscribe message system by the log collection module, then calculation engine reads call-chain messages and network element indicator messages from the distributed publish and subscribe message system.

[0070] Please refer to Figure 2, the method for locating root cause of business system failure provided by the implementation of the present invention, the method of calculation engine filters out the failure service interface comprises:

Date recue / Date received 202 1-1 1-30

[0071] Obtaining and parsing the call-chain messages to get call-chain ID, service interface name identification and event corresponding to call-chain message, meanwhile, obtaining call-chain entry message;

[0072] Configuring sampling ratio of the call-chain messages, obtaining call-chain entry message sample according to the sampling ratio;

[0073] According to call-chain ID and event time to filter out call-chain messages belonging to the same group as the call-chain entry message sample, assembling all call-chain message of the same group to get call-chain information;

[0074] Obtaining time consumption of each service interface in call-chain and filtering out failure service interface according to time consumption comparison table.

[0075] Wherein, the method of parsing the call-chain messages to get call-chain entry message comprises:

[0076] Setting expiration time of the call-chain messages, filtering and parsing unexpired call chain messages according to the expiration time, obtaining call-chain ID, service interface name identification and event time of the call-chain messages;

[0077] Filtering out call-chain entry message in the call-chain messages according to call-chain ID and service interface name identification.

[0078] In specific implementation, after calculation engine receives a call-chain message, it will first judges whether the message has expired according to the event time in the call-chain message. If the message exceeds the set expiration time, then discarding directly. If the message does not exceed the set expiration time, then parsing the call-chain ID and service interface name identification, and then filtering Date recue / Date received 202 1-1 1-30 out the call-chain entry message in the call chain message according to the call-chain ID and service interface name identification, the data format of call-chain entry message can be (service interface name identification, call-chain ID, event time), real-time calculating the TPS
(Transactions Per Second, that is the number of transactions processed by the server per second) identified by the service interface name of the call-chain entry message, then configuring the call-chain message according to the sampling ratio, obtaining a sample of the call-chain entry message according to the sampling ratio.

[0079] In the sampling process, first loading the sampling gradient corresponding to the configuring TPS, each level of sampling gradient corresponds to different TPS range and sampling ratio mapping, different sampling ratio are mapped to the rotation range, for example, the sampling ratio is 10%, the rotation range is 10, when the first data comes, the first data goes to sampling, the next 9 will not go to sampling, the 11th data will enter subsequent cycle which is the same as the first data in the subsequent cycle, and will go to sampling. After each sampling, the sampling gradient is obtained in real time according to the TPS value, and the correspondingly sampling ratio and cycle are obtained according to the sampling gradient. The call-chain entry message included in the sampling and not included in the sampling will all be retained, but will be marked whether the call-chain entry message is a sample, the data format of the marked call-chain entry message is (service connection name identification, all-chain ID, event time, whether it is sample), and temporarily storing in the sampling comparison table, the call-chain entry message in the sampling comparison table can be set to 60 seconds to expire.

[0080] Then all unexpired call-chain messages are sampled based on the sampling comparison table, that is, after receiving the call-chain message, checking the sampling comparison table according to the call-chain ID and event time. If the call-chain entry message corresponding to the call-chain message in the sampling comparison table is marked as a sample, then the call-chain message and the correspondingly call-chain entry message are correspondingly stored as the same group of data;
if the call-chain entry message corresponding to the call-chain message in the sampling comparison table is marked as non-sample, then discarding the call-chain message; if the sampling comparison table does not have a call-chain Date recue / Date received 202 1-1 1-30 entry message corresponding to the call-chain message, then caching the call-chain message for 60 seconds, and checking whether there is a corresponding call-chain message after the sampling comparison table is updated within 60 seconds, if the corresponding call-chain entry message is still not found after 60 seconds, discarding the call-chain message. In this sampling method, multi-level sampling gradients are set, the sampling ratio can be adjusted according to the amount of data, which is conducive to flexibly coping with different amounts of data and will not cause system instability due to the surge of the message amount; also performing balanced sampling according to the amount of TPS of the business line (call link) to ensure that data can be sampled for each business line.

[0081] In actual sampling, sampled call-chain messages can be grouped caching according to "minutes" ( absolute value of minutes converted from the event time, starting from 0:00 on January 1, 1970) and "call-chain ID" , when each call-chain message arrives which will be judged whether all the messages of the current call-chain have arrived, if all messages arrive, the call-chain will be assembled and the actual time consumption of each node in the call-chain will be obtained at the same time, that is the time consumption of each service interface in the call-chain, and finally the failure service interface is filtered out according to the time consumption comparison table. Wherein, the method for filtering out failure service interface according to the time consumption comparison table includes:

[0082] Pre-calculating average time consumption corresponding to service interface, storing service interface name identification of the service interface and the average time in the time consumption comparison table with one-to-one correspondence;

[0083] Obtaining service interface in call-chain, wherein the service interface takes more time than the average time in the time consumption comparison table, storing the service interface as alternative failure service interface;

[0084] Calculating failure credibility of alternative failure service interface according to actual time and Date recue / Date received 202 1-1 1-30 correspondingly average time of the alternative failure service interface, filtering out alternative failure service interface with greater failure credibility than pre-set credibility threshold and storing as alternative failure service interface.

[0085] In specific implementation, the time consumption comparison table can be generated every time the system for locating root cause of the failure is started or can be directly imported into the locally existing time consumption comparison table. For example, after starting the system for locating root cause of the failure, after receiving the call-chain information, the average time consumption corresponding to the service interface is calculated and the time consumption comparison table is updated in real time, when the time consumption data collected by a certain service interface in the comparison table reaches a credible sample size, marking the service interface in the time consumption comparison table for failure judgement, the updated and completed time consumption comparison table can be stored locally and can be directly imported to use when the system for locating root cause of failure is restarted.

[0086] For the assembled call-chain, analyzing the node (service interface) that has failed in the call-chain according to the time consumption comparison table, in other words, the service interface that actually takes more time that the average time in the time consumption comparison table in the call-chain is stored as an alternative failure serve interface, at the same time, calculating the failure credibility of the alternative failure service interface according to the actual time and the correspondingly average time of the alternative failure service interface, filtering out the alternative failure service interface which has greater failure credibility than the pre-set credibility threshold, storing as a failure service interface, the failure service interface can be stored correspondingly with the entire call-chain in which it is located, so as to troubleshoot the network element indicator from upstream and downstream. Among them, failure credibility = actual time consumption / recent average time consumption value, the recent time consumption average value can be calculated by taking 360 time consumption samples of a specific sample size, the larger the failure credibility value, the higher the credibility, the credibility threshold can be set to 1.8 and which can be adjusted according to actual business.
Date recue / Date received 202 1-1 1-30

[0087] Please refer to Figure 2, in an implementation of the present invention, a method for locating root cause of business system failure is provided, since the network element indicator collection is at a fixed frequency and there will be no sudden surge, so directly analyzing the full amount of data without sampling, the method of filtering out the failure network element indicator comprises:

[0088] Pre-calculating sample mean and stability of network element, and correspondingly storing indicator name identification, stability and sample mean of network element in stability comparison table;

[0089]
Obtaining and parsing network element indicator messages, obtaining indicator name identification and current value of the network element indicator;

[0090] Searching for correspondingly sample mean and stability from the stability comparison table according to the indicator name identification;

[0091] According to the current value of network indicator and the correspondingly sample mean and stability, calculating first variant range of network element indicator, storing the network element indicator with greater first variant range than pre-set first variant threshold as failure network element indicator.

[0092] In specific implementation, indicator name identification and indicator value of network element indicator are stored in a sliding matrix, each column of the sliding matrix corresponds to a minute span, each row corresponds to an indicator value identified by an indicator name identification; or each row corresponds to a minute span, each column corresponds to an indicator value identified by an indicator name identification. After the new network element indicator information arrives which is added to the sliding matrix, the default width of the sliding matrix is 128, each row corresponds to the combination of the indicator name identification of storage network element and the indicator value of the network element, each column corresponds to 1 minute span by default, in other words, a row of data actually stores the Date recue / Date received 202 1-1 1-30 indicator value of a certain network element indicator name within 128 minutes, among them, the minute span can be adjusted according to different sampling frequencies to generate different sliding matrices, putting network element indicators of the same sampling frequency in the same sliding matrix, the sliding matrix periodically clears expired data and initializes new data, if the width of the sliding matrix is 128, each column corresponds to 1 minute span by default, the data valid time is 128 min.

[0093] Specifically, in the method for the calculation engine to filter out the failure network element indicators, the method of pre-calculating sample mean and stability of network element comprises:

[0094] Collecting x network element indicator samples;

[0095] Calculating mean value of the network element indicator samples and second variant range of each network element indicator sample, counting quantity n of failure network element indicator samples with greater second variant range than pre-set second variant threshold;

[0096] Calculating stability m of network element indicator, wherein, m = x¨n.

[0097] For example, the network element indicator s has 360 continuous sample indicator values si, wherein, 1 i 360, the average value of the sample indicator value is v, setting the second variant range ai = Isi ¨ vl/v, setting the second variant range threshold to 0.25, when ai>
0.25, which considering that the network element indicator s has actual changes, if the number of actual changes is n, then the stability m of the network element indicator s is m = (360 ¨ n)/360.

[0098] In addition, the method of calculation engine filters out failure network element indicator also comprises:

Date recue / Date received 202 1-1 1-30

[0099] Obtaining pre-set quantity of network element indicator value before current network element indicator value as recent network element indicator value;

[0100] Calculating recent mean value according to the recent network element indicator value;

[0101] Calculating recent variant range of network element indicator according to the recent mean value and the current value of network element indicator;

[0102] Calculating sample variant range of network element indicator according to the sample mean and the current value of network element indicator; and

[0103] After weighted calculation based on the recent variant range and the sample variant range of network element indicator, combining with stability of network element indicator to obtain the first variant range of network element indicator.

[0104] For example, setting the first variant range of network element indicators is q, the current value is s1, sample mean is v, recent mean is k, stability is m, then

[0105] q = ( __ ' ______ x 0.4 + x 0.6) x m

[0106] In specific implementation, 5 to 10 recent samples can be taken to calculate the recent sample mean k, 0.4 and 0.6 are respectively the weight of the recent variant range and the same variant range in the calculation process of the first variant range, it can be adjusted according to the specific situation, finally, determining whether the network element indicator is a failure network element indicator according to whether the first variant range exceeds the first variant range threshold, the first variant range threshold can be set to 0.45, and this value can also be adjusted according to specific conditions.

Date recue / Date received 202 1-1 1-30

[0107] Finally, calculation engine correlates the failure service interface with the failure network element indicator according to time and issues failure warningõ the failure warning comprises failure service interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail. The failure service interface and the failure network indicator can be stored in cache media such as URedis, for example, Redis stores failure service interfaces and failure network element indicators that occur within 15 minutes for failure correlation analysis to call.

[0108] In specific implementation, according to failure "minutes" and correlation of network element, the service interface failure of the call-chain is associated with the network element indicator failure and ranking the probability of the network element indicator failure that caused the service interface failure, when the number of occurrences of a service interface failure within a minute exceeds the threshold, the issuing a failure warning. If the number of service interface failure increases in the subsequent time, determining that the service interface failure continues, and issuing a failure warning, meanwhile, the determined service interface failure and the network element indicator failure that caused the failure are correspondingly stored for later analysis; if there is no change in the number of the service interface failure in the subsequent time or disappears in the next minute, the failure warning will be cancelled. Failure warning such as:

[0109] 1. Initial announcement content of failure release.

[0110] Failure[warning], No.AAAA-BBBB-CCCC-NNNNNNN, the service interface i on the network element A fails, and the service time exceeds the standard value by 80%, the standard value is Xa, the current average time consumption is xa, the possible causes are ranked: 1. The indicator i on the network element B is anomaly; 2. The indicator j on the network element A is anomaly.

[0111] 2. The failure continues after 1 minute.

Date recue / Date received 202 1-1 1-30

[0112] Failure[warning], No.AAAA-BBBB-CCCC-NNNINNNN, the service interface i on the network element A fails, and the service time exceeds the standard value by 80%, the standard value is Xa, the current average time consumption is xa, the possible causes are ranked: 1. The indicator i on the network element B is anomaly; 2. The indicator j on the network element A is anomaly.

[0113] 3. The failure disappears after 1 minute.

[0114] Failure[dismiss], No.AAAA-BBBB-CCCC-NNNNNNN, the service interface i on the network element A fails, and the service time exceeds the standard value by 80%, the standard value is Xa, the current average time consumption is xa, the possible causes are ranked: 1. The indicator i on the network element B is anomaly; 2. The indicator j on the network element A is anomaly.

[0115] In addition, the service interface deployed by the network element will also generate an exception stack message when the interface is called and storing in the exception stack log, the exception stack log is also pushed to Kafka and other distributed publish and subscribe message system through the log collection module, the calculation engine obtains the exception stack message from the distributed publish and subscribe message system, parsing to obtain call-chain ID, service ID, and exception stack data and storing in the distributed document database, for example in ES. When user checks the failure according to the failure warning, the correspondingly indicator of ES will be checked according to the date information, and the exception stack can be queried and displayed through the call-chain ID, service ID, etc., the specific type and code line of the exception can be located in the exception stack.

[0116] The method for locating root cause of business system failure provided in this implementation can adapt to new business scenarios without human intervention, after self-learning is completed, obtaining reference values such as time consumption comparison table, then locating the failure. In addition, in the process of locating root cause of business system failure, the failure sensitivity can be adjusted which can provide early warning and reduce the failure sensitivity to enhance the accuracy of the early warning, also Date recue / Date received 202 1-1 1-30 effectively reducing the complexity of troubleshooting for the operation and maintenance personnel and the person in charge of system, saving time and labor costs.

[0117] Implementation two

[0118] As shown in Figure 3, a system for locating root cause of business system failure provided by the implementation of the present invention, the system comprises a calculation engine, a distributed publish and subscribe message system and a network element, the calculation engine adopts Flink which comprises a message reading module, a call-chain processing module, a network element indicator processing module, and a failure warning module. Wherein the distributed publish and subscribe adopts Kafka which is configured to store network element indicators and generated call-chain messages when service interface deployed by the network element is called; the message reading module is configured to read call-chain messages and network element indicator messages from the distributed publish and subscribe message system; the call-chain processing module is configured to sampling the call-chain messages according to sampling ratio, grouping and assembling to obtain the call-chain, obtaining time consumption of each service interface in the call-chain, and filtering out failure service interface according to time consumption comparison table; the network element indicator processing module is configured calculate first variation range of the network element indicator and filter out failure network element indicator according to the first variation range threshold value; the failure warning module is configured to correlate the failure service interface with the failure network element indicator according to time and issues failure warning, the failure warning comprises failure service interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail.

[0119] The system for locating root cause of business system failure provided by the present invention uses the method of locating root cause of business system failure in the above-mentioned first implementation, by combining the call-chain log and network element indicator data, the root cause of business system failure can be quickly located and maintenance costs are reduced. Comparing with the Date recue / Date received 202 1-1 1-30 prior art, the beneficial effects of the system for locating root cause of business system failure provided by the implementation of the present invention are the same as the beneficial effects of the method for locating the root cause of business system failure provided in the first implementation, and other technical features in the system are the same as those disclosed features of the method in the previous implementation, which will not repeat here.

[0120] In the above-mentioned descriptions of implementation methods, specific features, structures, materials or characteristics can be combined in any one or more implementations or examples in a suitable way.

[0121] As the above-mentioned, which are only specific implementations of the present invention, but the protection scope of the present invention is not limited thereto, any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, it should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Date recue / Date received 202 1-1 1-30

Claims

Claims:

1. A method for locating root cause of business system failure comprises:
calculation engine obtains call-chain messages, sampling the call-chain messages according to sampling ratio, grouping and assembling to obtain the call-chain, obtaining time consumption of each service interface in the call-chain, and filtering out failure service interface according to time consumption comparison table;
calculation engine obtains network element indicator messages, calculating first variation range of the network element indicator, filtering out failure network element indicator according to the first variation range threshold value; and calculation engine correlates the failure service interface with the failure network element indicator according to time and issues failure warning, the failure warning comprises failure service interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail.

2. The method for locating root cause of business system failure according to claim 1, wherein, the method of calculation engine obtains call-chain messages and network element indicator messages comprises:
network element periodically generates network element indicator messages and stores in indicator log file;
generating call-chain messages when service interface deployed by network element is called and storing in call-chain unit log file;
network element uses log file collection module to push newly added network element indicator Date recue / Date received 2021-11-30 messages of the indicator log file and newly added call-chain messages of the call-chain unit log file to distributed publish and subscribe message system; and calculation engine reads call-chain messages and network element indicator messages from the distributed publish and subscribe message system.

3. The method for locating root cause of business system failure according to claim 1, wherein, the method of calculation engine filters out the failure service interface comprises:
obtaining and parsing the call-chain messages to get call-chain ID, service interface name identification and event corresponding to call-chain message, meanwhile, obtaining call-chain entry message;
configuring sampling ratio of the call-chain messages, obtaining call-chain entry message sample according to the sampling ratio;
according to call-chain ID and event time to filter out call-chain messages belonging to the same group as the call-chain entry message sample, assembling all call-chain message of the same group to get call-chain information; and obtaining time consumption of each service interface in call-chain and filtering out failure service interface according to time consumption comparison table.

4. The method for locating root cause of business system failure according to claim 3, wherein, the method of parsing the call-chain messages to get call-chain entry message comprises:
setting expiration time of the call-chain messages, filtering and parsing unexpired call chain Date recue / Date received 2021-11-30 messages according to the expiration time, obtaining call-chain ID, service interface name identification and event time of the call-chain messages;
filtering out call-chain entry message in the call-chain messages according to call-chain ID and service interface name identification.

5. The method for locating root cause of business system failure according to claim 3, wherein, the method of filtering out failure service interface according to time consumption comparison table comprises:
pre-calculating average time consumption corresponding to service interface, storing service interface name identification of the service interface and the average time in the time consumption comparison table with one-to-one correspondence;
obtaining service interface in call-chain, wherein the service interface takes more time than the average time in the time consumption comparison table, storing the service interface as alternative failure service interface; and calculating failure credibility of alternative failure service interface according to actual time and correspondingly average time of the alternative failure service interface, filtering out alternative failure service interface with greater failure credibility than pre-set credibility threshold and storing as alternative failure service interface.

6. The method for locating root cause of business system failure according to claim 1, wherein, the method of calculation engine filters out failure network element comprises:
pre-calculating sample mean and stability of network element, and correspondingly storing indicator name identification, stability and sample mean of network element in stability comparison table;
Date recue / Date received 2021-11-30 obtaining and parsing network element indicator messages, obtaining indicator name identification and current value of the network element indicator;
searching for correspondingly sample mean and stability from the stability comparison table according to the indicator name identification; and according to the current value of network indicator and the correspondingly sample mean and stability, calculating first variant range of network element indicator, storing the network element indicator with greater first variant range than pre-set first variant threshold as failure network element indicator.

7. The method for locating root cause of business system failure according to claim 6, the method of pre-calculating sample mean and stability of network element comprises:
collecting x network element indicator samples;
calculating mean value of the network element indicator samples and second variant range of each network element indicator sample, counting quantity n of failure network element indicator samples with greater second variant range than pre-set second variant threshold; and calculating stability m of network element indicator, wherein, m = x:n.

8. The method for locating root cause of business system failure according to claim 6 or 7, the method of calculation engine filters out failure network element indicator also comprises:
obtaining pre-set quantity of network element indicator value before current network element Date recue / Date received 2021-11-30 indicator value as recent network element indicator value;
calculating recent mean value according to the recent network element indicator value;
calculating recent variant range of network element indicator according to the recent mean value and the current value of network element indicator;
calculating sample variant range of network element indicator according to the sample mean and the current value of network element indicator; and after weighted calculation based on the recent variant range and the sample variant range of network element indicator, combining with stability of network element indicator to obtain the first variant range of network element indicator.

9. The method for locating root cause of business system failure according to claim 6, wherein, indicator name identification and indicator value of network element indicator are stored in a sliding matrix based on time, each column of the sliding matrix corresponds to a minute span, each row corresponds to an indicator value identified by an indicator name identification; or each row corresponds to a minute span, each column corresponds to an indicator value identified by an indicator name identification.

10. A system for locating root cause of business system failure, wherein, comprising a calculation engine, a distributed publish and subscribe message system and a network element, the calculation engine comprises a message reading module, a call-chain processing module, a network element indicator processing module, and a failure warning module, wherein, the distributed publish and subscribe message system is configured to store network element indicators and generated call-chain messages when service interface deployed by the network element is called;

Date recue / Date received 2021-11-30 the message reading module is configured to read call-chain messages and network element indicator messages from the distributed publish and subscribe message system;
the call-chain processing module is configured to sampling the call-chain messages according to sampling ratio, grouping and assembling to obtain the call-chain, obtaining time consumption of each service interface in the call-chain, and filtering out failure service interface according to time consumption comparison table;
the network element indicator processing module is configured calculate first variation range of the network element indicator and filter out failure network element indicator according to the first variation range threshold value; and the failure warning module is configured to correlate the failure service interface with the failure network element indicator according to time and issues failure warning, the failure warning comprises failure service interface and failure network element indicator, wherein the failure network element indicator causes the failure service interface to fail.

Date recue / Date received 2021-11-30