CN117170914A

CN117170914A - Fault locating method, device, computer system and readable storage medium

Info

Publication number: CN117170914A
Application number: CN202210595963.5A
Authority: CN
Inventors: 宋佳慧; 庄晓天; 吴盛楠
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2023-12-05

Abstract

The present disclosure provides a fault localization method, comprising: determining a fault chain under the condition that service abnormality is detected, wherein the fault chain comprises a plurality of applications with calling relations, and each application has corresponding service equipment; determining a fault application causing service abnormality in a fault chain from a plurality of applications; acquiring service index feature data of candidate service equipment associated with a fault application; acquiring fault characteristic data of a historical fault service device associated with a historical fault application; and performing density cluster analysis on the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment through a density cluster model, and determining the fault service equipment causing abnormal service from the candidate service equipment. The disclosure also provides a fault locating device, a computer system and a readable storage medium.

Description

Fault locating method, device, computer system and readable storage medium

Technical Field

The present disclosure relates to the fields of computer technology, internet technology, and electronic technology, and more particularly, to a fault locating method, apparatus, computer system, and readable storage medium.

Background

With the rapid development of artificial intelligence, automatic control, communication and computer technology, maintaining network production safety is an important mission for network operation and maintenance. As the service flow is larger and larger, the service logic is more and more complex, the dependent service of the system access is more and more, when the system fails, the magnitude of the error log is increased sharply, the error reporting application is increased, and the temporary interruption of the network can be caused, so that the normal operation of the application and the service equipment is affected.

In the process of realizing the conception of the present disclosure, the inventor finds that in the related art, an operation and maintenance person needs to quickly locate which application and which service device have faults through historical experience, then discharges the faults, and ensures the normal operation of the application and the service device.

However, for services and applications with complex calling relations, a plurality of applications are frequently reported in error at the same time, and an operation and maintenance person cannot locate the error application in time according to the historical experience of the operation and maintenance person facing a large number of reported errors, so that the difficulty of fault detection is increased, and the pressure of the operation and maintenance person is increased.

Disclosure of Invention

In view of this, the present disclosure provides a fault localization method, apparatus, computer system, readable storage medium, and program product.

One aspect of the present disclosure provides a fault localization method, including: determining a fault chain under the condition that service abnormality is detected, wherein the fault chain comprises a plurality of applications with calling relations, and each application has corresponding service equipment; determining a fault application causing service abnormality in a fault chain from a plurality of applications; acquiring service index feature data of candidate service equipment associated with a fault application; acquiring fault characteristic data of a historical fault service device associated with a historical fault application; and performing density cluster analysis on the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment through a density cluster model, and determining the fault service equipment causing abnormal service from the candidate service equipment.

According to an embodiment of the present disclosure, obtaining fault signature data of a historical fault service device associated with a historical fault application includes: aiming at service indexes of the historical fault service equipment, analyzing data corresponding to the service indexes according to a time sequence, and determining the starting time of fault occurrence; acquiring data corresponding to a normal service index in a first time period and data corresponding to an abnormal service index in a second time period according to the starting time of the fault, wherein the first time period comprises a normal time period of the operation of the historical fault service equipment before the starting time of the fault, and the second time period comprises an abnormal time period of the operation of the historical fault service equipment after the starting time of the fault; according to the data set corresponding to the normal service index and the data corresponding to the abnormal service index, determining the probability value of the data corresponding to the abnormal service index under the condition that the history fault service equipment operates in the data set corresponding to the normal service index; the probability value of the data corresponding to the abnormal service index is determined as the fault characteristic data of the historical fault service device.

According to an embodiment of the present disclosure, for service indicators of a historical fault service device, data corresponding to the service indicators are analyzed in time sequence, and a start time of occurrence of a fault is determined, including: performing first-order difference operation on the data corresponding to the service index to obtain a first-order difference result; and detecting the time sequence abnormality of the first-order difference result, and determining the starting time of the fault occurrence.

According to an embodiment of the present disclosure, in which a density cluster analysis is performed on service index feature data of a candidate service device and fault feature data of a history fault service device by a density cluster model, a fault service device causing an abnormal service is determined from the candidate service devices, including: inputting service index feature data of candidate service equipment and fault feature data of historical fault service equipment into a density clustering model to perform density clustering processing to obtain a plurality of clusters; a faulty serving device causing an abnormal service is determined from the candidate serving devices according to the service index feature data of the candidate serving devices in each cluster and the duty ratio of the fault feature data of the historical faulty serving device.

According to an embodiment of the present disclosure, wherein determining a fault service device causing an abnormal service from among candidate service devices according to a duty ratio of service index feature data of the candidate service devices and fault feature data of a history of fault service devices in each cluster includes: determining the feature data with the largest proportion in the cluster according to the service index feature data of the candidate service equipment in each cluster and the proportion of the fault feature data of the historical fault service equipment; in the case where it is determined that the feature data having the largest ratio in the cluster is the failure feature data of the history failure service device, the service device corresponding to the service index feature data of the candidate service devices in the same cluster as the failure feature data is determined as the failure service device causing the service abnormality.

According to an embodiment of the present disclosure, determining a failed application in a failure chain that causes a service exception from a plurality of applications includes: determining, for each of a plurality of applications, an initial contribution value of each application to causing a service anomaly; constructing initial stable distribution vectors for a plurality of applications in a fault chain according to initial contribution values of each application to service abnormality; carrying out iterative processing on the initial stable distribution vector to obtain a stable distribution vector meeting a preset condition; an application in each application contribution value in the stationary distribution vector that satisfies a threshold is determined as a failed application in the failure chain that caused the service anomaly.

Another aspect of the present disclosure provides a fault locating device, comprising: the first determining module is used for determining a fault chain under the condition that the service abnormality is detected, wherein the fault chain comprises a plurality of applications with calling relations, and each application is provided with corresponding service equipment; the second determining module is used for determining a fault application which causes abnormal service in a fault chain from a plurality of applications; the first acquisition module is used for acquiring service index characteristic data of candidate service equipment associated with the fault application; the second acquisition module is used for acquiring fault characteristic data of the historical fault service equipment associated with the historical fault application; and the analysis module is used for carrying out density cluster analysis on the service index characteristic data of the candidate service equipment and the fault characteristic data of the historical fault service equipment through the density cluster model, and determining the fault service equipment causing abnormal service from the candidate service equipment.

Another aspect of the present disclosure provides a computer system comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods described above by embodiments of the present disclosure.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are configured to implement the methods described above in embodiments of the present disclosure.

Another aspect of the present disclosure provides a computer program product comprising computer executable instructions which, when executed, are adapted to carry out the method described above for embodiments of the present disclosure.

According to the embodiment of the disclosure, by firstly determining a fault application causing service abnormality from a plurality of applications in a fault chain, after locating the fault application, performing density cluster analysis according to service index feature data of candidate service equipment associated with the fault application and fault feature data of historical fault service equipment associated with historical fault application, and determining the fault service equipment causing abnormal service from the candidate service equipment. The fault service device is characterized in that when the service is abnormal, the application causing the actual fault of the service abnormality in the fault chain is automatically positioned, and then the fault service device is automatically positioned according to the candidate device associated with the fault application, so that the positioning of the fault root cause is finally realized, the technical problems that the positioning of the fault application and the fault service device is realized by relying on experience accumulated manually, the fault positioning is inaccurate, the troubleshooting difficulty is increased, and the fault response is not timely are at least partially overcome, and the technical effects of improving the fault positioning accuracy, having high response speed and shortening the fault positioning time are achieved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates an exemplary system architecture of fault localization methods and apparatus according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a fault localization method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flowchart of a method of obtaining fault signature data of a historical fault service device associated with a historical fault application according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a block diagram of a fault locating device according to an embodiment of the present disclosure; and

fig. 5 schematically illustrates a block diagram of an electronic device adapted to implement a fault localization method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The inventor finds that when the service flow is small and the logic complexity is low, the service abnormality error reporting time operation and maintenance personnel generally need to locate the problem according to the error log. However, with the iteration of the business logic, the dependent services accessed by the system are continuously increased, when the system fails (for example, bug is triggered, dependent task is overtime, etc.), the magnitude of the error log is rapidly increased, the error reporting application is increased, and the operation and maintenance personnel can not find out and solve the core-most problem in the first time when facing a large number of error reporting. When multiple applications are simultaneously in error, a fault chain may be quite complex, and the faulty application cannot be located in time in a large number of faulty applications, so that the response is not in time, and the normal operation of the applications and the service equipment is affected.

In view of this, in order to improve the fault location accuracy, the response speed is fast, shortens fault location time, guarantees the normal operation of application and service equipment in the process of realizing this disclosed scheme, needs to realize automatic location fault application, and then according to the service index of the candidate service equipment that fault application corresponds, automatic location fault service equipment realizes the location of fault root cause.

The embodiment of the disclosure provides a fault locating method. The method comprises the steps of determining a fault chain under the condition that service abnormality is detected, wherein the fault chain comprises a plurality of applications with calling relations, and each application has corresponding service equipment. A failed application in the failure chain that causes a service anomaly is determined from the plurality of applications. Service index feature data of candidate service devices associated with the faulty application is obtained. Failure feature data of a historical failure service device associated with the historical failure application is obtained. And performing density cluster analysis on the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment through a density cluster model, and determining the fault service equipment causing abnormal service from the candidate service equipment.

Fig. 1 schematically illustrates an exemplary system architecture of fault localization methods and apparatus according to embodiments of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients and/or social platform software, to name a few.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the fault locating method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the fault locating device provided by the embodiments of the present disclosure may be generally disposed in the server 105. The fault localization method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the fault locating device provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Alternatively, the fault locating method provided by the embodiment of the present disclosure may be performed by the terminal device 101, 102, or 103, or may be performed by another terminal device different from the terminal device 101, 102, or 103. Accordingly, the fault locating device provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103, or in another terminal device different from the terminal device 101, 102, or 103.

For example, the failure information about the service abnormality may be originally stored in any one of the terminal devices 101, 102, or 103 (for example, but not limited to, the terminal device 101), or stored on an external storage device and may be imported into the terminal device 101. Then, the terminal device 101 may locally perform the fault locating method provided by the embodiment of the present disclosure, or transmit fault information of a service abnormality to other terminal devices, servers, or server clusters, and perform the fault locating method provided by the embodiment of the present disclosure by the other terminal devices, servers, or server clusters that receive the fault information of the service abnormality.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of a fault localization method according to an embodiment of the present disclosure.

As shown in fig. 2, the fault location method 200 may include operations S210 to S250.

In operation S210, in case of detecting a service abnormality, a fail chain is determined, wherein the fail chain includes a plurality of applications having a call relationship, wherein each application has a corresponding service device.

In operation S220, a failed application in the fail chain causing a service abnormality is determined from among the plurality of applications.

In operation S230, service index feature data of a candidate service device associated with the faulty application is acquired.

In operation S240, fault signature data of a historical fault service device associated with the historical fault application is acquired.

In operation S250, the service index feature data of the candidate service device and the fault feature data of the history fault service device are subjected to density cluster analysis by the density cluster model, and the fault service device causing the abnormal service is determined from the candidate service devices.

In a large system operation scenario, according to embodiments of the present disclosure, interactions between multiple applications are required to accomplish a certain task. In the event that a service anomaly is detected, multiple applications may be simultaneously alerted by being affected by each other.

According to embodiments of the present disclosure, a fault chain may include a plurality of applications having an inter-call relationship. The plurality of applications having the inter-call relationship may include applications that do not cause abnormal services when an alarm occurs, and may also include applications that actually cause abnormal services.

According to the embodiment of the disclosure, a plurality of applications in the fault chain can be analyzed through a deep learning model so as to determine the fault application causing abnormal service in the fault chain from the plurality of applications, and therefore application level root cause positioning is achieved.

According to the embodiment of the disclosure, according to the interface relation between each application and the service device, the service device corresponding to each application can be determined, wherein each application can correspond to one service device or can correspond to a plurality of service devices. The candidate service device may be a service device associated with the failed application that may cause an abnormal service.

According to an embodiment of the present disclosure, the candidate service device may include a plurality of service indicators, and the service indicator feature data may be a set of probabilities that each of the plurality of service indicators is abnormal when the candidate service device fails. Each service indicator occurrence anomaly may include a phenomenon in which the service indicator occurrence indicator value rises or falls.

According to the embodiment of the disclosure, since the service device generally includes a fault root cause when the service device fails, in order to achieve positioning of the service device fault root cause when the service device alarms, fault characteristic data of the fault service device associated with the historical fault application can be acquired in advance.

According to an embodiment of the present disclosure, the fault service device may also include a plurality of historical service indicators, and the fault signature data may include a set of probabilities of determining that each of the plurality of historical service indicators is faulty based on the fault service device. Each historical service indicator failure may include a phenomenon in which the historical service indicator value rises or falls.

According to the embodiment of the disclosure, the distance between the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment can be calculated through a distance formula, the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment are subjected to density clustering to obtain a clustering result, the fault service equipment causing abnormal service is determined from the candidate service equipment through the clustering result, and the fault root cause positioning of the service equipment is realized.

According to an embodiment of the present disclosure, determining a failed application in a failure chain that causes a service exception from a plurality of applications may include: for each of a plurality of applications, an initial contribution value of each application to causing a service anomaly is determined. An initial stationary distribution vector for a plurality of applications in the fault chain is constructed from the initial contribution value of each application to causing the service anomaly. And carrying out iterative processing on the initial stable distribution vector to obtain the stable distribution vector meeting the preset condition. An application in each application contribution value in the stationary distribution vector that satisfies a threshold is determined as a failed application in the failure chain that caused the service anomaly.

According to the embodiment of the disclosure, the fault application causing the service abnormality in the fault chain can also be determined from a plurality of applications through a preset algorithm. For example, the preset algorithm may be a Page-Rank algorithm, and the fault application causing the service abnormality in the fault chain may be determined by improving the conventional Page-Rank algorithm in the operation and maintenance scenario.

According to embodiments of the present disclosure, a quantity assumption and a quality assumption may be set based on improvements to the Page-Rank algorithm. The quantitative assumption may be characterized as the more an application is called by other applications among the multiple applications in the fault chain, the more likely this application is a faulty application that causes an abnormal service. The quality assumption may be characterized as the greater the number of times an application is invoked among multiple applications of the fault chain, the greater the contribution of that application to causing the abnormal service.

According to embodiments of the present disclosure, the initial contribution value may be a characterization value that is initially analyzed for the likelihood of causing a service anomaly for each application.

According to embodiments of the present disclosure, it is assumed that N applications that invoke each other may be contained in the failure chain, N e {1,2,..i,.}. For each of a plurality of applications of the fault chain, e.g. for the ith application p in the fault chain _i Is of the initial contribution value PR (p) _i ) Can be obtained by the formula (1). Namely, the formula (1) is as follows:

wherein p is _i Is the ith application in the fault chain; n is the number of all applications in the failure chain; d is application p _i Calling probabilities of other applications after reporting errors; 1-d is application p _i Probability of not calling other applications after reporting errors; m (p) _i ) For application p _i Calling application sets of other applications; p is p _j ∈M(p _i ) For application p _j Quilt p belonging to application set _i Calling an application; l (p) _j ) For application p _j The number called by other applications; PR (p) _j ) Application p of j _j Is included.

According to an embodiment of the present disclosure, an initial contribution value, i.e., PR (p) ₁ )、PR(p ₂ )、…、PR(p _i )、…、PR(p _N )。

According to embodiments of the present disclosure, an initial stationary distribution vector for a plurality of applications in a fault chain may be constructed using an initial contribution value of each node on the fault chain. That is, each component in the initial stationary distribution vector is the initial contribution value for each application.

According to embodiments of the present disclosure, the initial stationary distribution vector may characterize the probability of being invoked by all applications on the fault chain that have an interrelation. The initial stationary distribution vector R can be expressed as formula (2):

according to an embodiment of the present disclosure, assuming that initial call probabilities of N applications in a failure chain are equal, call probabilities in N applications areAnd carrying out iterative processing on the initial stable distribution vector until a preset condition is met. The iterative process may include: for initial flatThe steady distribution vector R is assigned with an initial value which can be +.>Obtain->For the vector obtained->Iterating with the initial stable distribution vector to obtain +.>Until a predetermined condition is satisfied, a stationary distribution vector R' satisfying the predetermined condition is obtained. The stationary distribution vector R' satisfying the predetermined condition can be expressed as formula (3):

according to an embodiment of the present disclosure, the predetermined condition may be that the initial stable distribution vector is subjected to iterative processing, and then the contribution value of each application is made to be stable to obtain a stable distribution vector. The meeting of the predetermined condition may be meeting a preset convergence condition, or may be meeting a predetermined number of iterations.

According to an embodiment of the present disclosure, the component in the stationary distribution vector R' satisfying the predetermined condition is the final contribution value of each application after iteration. And comparing the final contribution value of each application in the stable distribution vector R' with a preset threshold value, and determining the application corresponding to the final contribution value of the application which is larger than or equal to the preset threshold value as the fault application causing the abnormal service in the fault chain.

According to the embodiment of the disclosure, the contribution of each application in the fault chain to cause service abnormality is determined by utilizing the Page-Rank algorithm, so that the possibility of causing abnormal service is determined by the contribution value of each application, and the automatic positioning of the application-level fault root cause is realized.

Fig. 3 schematically illustrates a flowchart of a method of obtaining fault signature data of a historical fault service device associated with a historical fault application according to an embodiment of the disclosure.

It should be noted that, the method for acquiring the service index feature data of the candidate service device associated with the fault application is the same as the method for acquiring the fault feature data of the history fault service device associated with the history fault application, and herein, the method for acquiring the service index feature data of the candidate service device associated with the fault application is described in detail in the disclosure, and is not described herein.

As shown in fig. 3, the method 300 may include operations S310-S340.

In operation S310, data corresponding to service indexes of the history fault service apparatus are analyzed in time sequence to determine a start time of occurrence of a fault.

In operation S320, data corresponding to the normal service index for a first period including a normal period in which the history fault service device operates before the start time of the occurrence of the fault and data corresponding to the abnormal service index for a second period including an abnormal period in which the history fault service device operates after the start time of the occurrence of the fault are acquired according to the start time of the occurrence of the fault.

In operation S330, a probability value of the data corresponding to the abnormal service index is determined in a case where the history fault service apparatus operates on the data corresponding to the normal service index, based on the data corresponding to the normal service index and the data corresponding to the abnormal service index.

In operation S340, a probability value of data corresponding to the abnormal service index is determined as fault characteristic data of the historical fault service device.

According to embodiments of the present disclosure, for example, the service metrics of the historical fault service device may include CPU, memory, input/output, and the like. When the service device fails, the service index of the service device may change. Each service indicator may be a set of data for the corresponding service indicator for each time instance. And analyzing the data of the service index corresponding to each moment in each service index according to the time sequence, and when the service index data of the time point corresponding to the service index has obvious change, rising or falling at a certain time point, indicating that the time point is the time point when the service index is abnormal.

According to the embodiment of the disclosure, when it is determined that the data corresponding to a certain service index at a time point is obviously changed, the time point is indicated to be the starting time of the fault of the historical fault equipment. When the service indexes are determined to be at different corresponding time points, the corresponding data are obviously changed, and each service index has a time point when the service index data are abnormal. The earliest point in time corresponding to the occurrence of the abnormality of the service index in each of the plurality of service indexes may be determined as the start time of occurrence of the fault.

According to the embodiment of the disclosure, after the starting time of fault occurrence is located, the change degree and the abnormality degree of massive service indexes in the historical fault service equipment need to be measured. The start time of the occurrence of the fault may be taken as a node, and a certain period of time of the history fault service device may be divided into data corresponding to the normal service index in the first period of time and data corresponding to the abnormal service index in the second period of time according to the start time of the occurrence of the fault. The data corresponding to the normal service index and the data corresponding to the abnormal service index may each be time-series data based on time series.

According to an embodiment of the present disclosure, for example, taking a service index CPU of a history fault service device as an example, in a preset period of time, a start time of occurrence of a fault is taken as a node, and in a normal period of time, normal service index data corresponding to the service index CPU of the history fault service device arranged in time series may be { x } _m The abnormal service index data corresponding to the service index CPU of the history fault service apparatus arranged in time series in the abnormal time period may be { x } _n And m and n are respectively a normal period andthe number of the data corresponding to the service index in the abnormal time period is more than or equal to 1 and is less than the number of the data corresponding to the service index in the preset time period.

According to the embodiment of the present disclosure, the data corresponding to the normal time period and the abnormal time period of other service indexes (e.g., memory, input and output) of the historical fault service device may be obtained by using the above method, which is not described herein.

According to the embodiments of the present disclosure, the probability value of the data corresponding to the abnormal service index may be determined by conditional probability calculation in the case where the history of failed service apparatuses is operated with the data corresponding to the normal service index. Specifically, in the case where the history fault service apparatus is operated on data corresponding to the normal service index, a probability value of occurrence of data corresponding to the abnormal service index in an abnormal time period in the history fault service apparatus is calculated, thereby realizing calculation of a probability value of occurrence of abnormality of each service index in the history fault service apparatus.

According to the embodiment of the present disclosure, assuming that the data corresponding to the service index of the history fault service apparatus are all independently distributed, in the case where the history fault service apparatus is operated with the data corresponding to the normal service index, calculating the probability that the data corresponding to the abnormal service index occurs in the abnormal period in the history fault service apparatus may be obtained by the formula (4), that is:

where k is the number of service indicators, P ^k ({x _n }|{x _m And) the probability that the kth service index in the historical fault service equipment generates data corresponding to the abnormal service index under the condition that the data corresponding to the normal index of the historical fault service equipment operates; x is the data corresponding to the service index to be measured, X _q The q-th data in the data corresponding to the abnormal service index, and the n-th data are the number of the data corresponding to the abnormal service index.

According to embodiments of the present disclosure, each occurrence of an anomaly in a service indicator may include a service indicator rise and a service indicator fall. The probability of the service index rising and the service index falling can be calculated by the following equation. The calculation modes are as follows formulas (5) and (6):

wherein,probability of rising for kth service indicator, < >>P (X.gtoreq.x) is the probability of the kth service index falling _q |{x _m And) the data corresponding to the service index to be tested meets the data x corresponding to the q-th abnormal service index or more under the condition that the data corresponding to the normal service index operates _q Conditional probability of a data sequence of (a). P (X is less than or equal to X) _q |{x _m And) under the condition that the data corresponding to the normal service index operates, the data corresponding to the service index to be tested meets the q-th data x corresponding to the abnormal service index or less _q Conditional probability of a data sequence of (a).

According to the embodiment of the present disclosure, the rise and fall probability values for each service index in the history fault service apparatus can be obtained by the above formulas (5) and (6). And obtaining fault characteristic data of the historical fault service equipment by calculating to obtain a value service index of the ascending and descending probability of each service index in the historical fault service equipment. Wherein the fault signature data may represent a set of probability values of rise and fall of each service indicator corresponding to the historical fault service device, and may be represented as

According to the embodiment of the present disclosure, for a candidate service device, it may be assumed that the candidate service device has l service indexes, and based on the above method for acquiring fault characteristic data of a historical fault service device, acquiring the service index characteristic data of the candidate service device associated with a fault application may be represented as { o } ₁ ，u ₁ ，o ₂ ，u ₂ ，...，o _l ，u _l }, where o _l The probability of rising for the first service indicator, u _l The probability of the drop for the first service indicator.

According to an embodiment of the present disclosure, for a service index of a history fault service apparatus, data corresponding to the service index is analyzed in time sequence, and a start time of occurrence of a fault is determined, including: performing first-order difference operation on the data corresponding to the service index to obtain a first-order difference result; and detecting the time sequence abnormality of the first-order difference result, and determining the starting time of the fault occurrence.

According to an embodiment of the present disclosure, performing a first order differential operation on data corresponding to a service index may include: the data corresponding to the service index can form a data set according to the time sequence, and the data at the later moment in the data set and the data at the previous moment are subjected to first-order difference calculation to obtain a first-order difference calculation result.

According to an embodiment of the present disclosure, time-series anomaly detection is performed on a first-order differential result, and, using a standard deviation method, a value, which deviates more than three times the standard deviation from the average value of the set of result values, among the detected set of result values is regarded as an anomaly value, and when the anomaly value changes suddenly increasing or suddenly decreasing, the corresponding current time may be regarded as a start time of occurrence of a failure of the service indicator. If only one service index corresponding to the historical fault service equipment is abnormal, the current time point when the service index is abnormal can be regarded as the starting time of the fault of the historical fault service equipment; if the plurality of service indexes are abnormal, the time point with the earliest fault occurrence time in the time points of the plurality of service indexes abnormal is the starting time of the fault occurrence of the historical fault service equipment.

In determining the start time of occurrence of the fault in the candidate device, the method of determining the start time of occurrence of the fault in the history fault service device may be adopted as well. And will not be described in detail herein.

According to an embodiment of the present disclosure, performing density cluster analysis on service index feature data of a candidate service device and fault feature data of a history fault service device by a density cluster model, determining a fault service device causing abnormal service from the candidate service devices, includes: inputting service index feature data of candidate service equipment and fault feature data of historical fault service equipment into a density clustering model to perform density clustering processing to obtain a plurality of clusters; a faulty serving device causing an abnormal service is determined from the candidate serving devices according to the service index feature data of the candidate serving devices in each cluster and the duty ratio of the fault feature data of the historical faulty serving device.

According to the embodiment of the disclosure, the fault feature data representing the historical fault service equipment and the service index feature data of the candidate service equipment can be input into a density clustering model, and the density clustering is performed on the fault feature data and the service index feature data based on the similarity between the fault feature data and the service index feature data to obtain a plurality of clusters. The similarity between the fault characteristic data and the service index characteristic data in each cluster is larger, and the correlation degree is larger.

According to the embodiments of the present disclosure, it is possible to determine whether or not there is a failed service device causing abnormal service among candidate service devices in each cluster according to the duty ratio sizes of the failed feature data and the service index feature data in each cluster.

According to an embodiment of the present disclosure, the similarity between the fault signature data and the service index signature data may be obtained by Pearson (Pearson) coefficients.

According to an embodiment of the present disclosure, determining a fault service device causing an abnormal service from among candidate service devices according to a duty ratio of service index feature data of the candidate service devices and fault feature data of a history of fault service devices in each cluster, includes: determining the feature data with the largest proportion in the cluster according to the service index feature data of the candidate service equipment in each cluster and the proportion of the fault feature data of the historical fault service equipment; in the case where it is determined that the feature data having the largest ratio in the cluster is the failure feature data of the history failure service device, the service device corresponding to the service index feature data of the candidate service devices in the same cluster as the failure feature data is determined as the failure service device causing the service abnormality.

According to the embodiment of the disclosure, each cluster contains the fault characteristic data and the service index characteristic data with relatively large correlation degree, the duty ratio of the fault characteristic data and the service characteristic data in each cluster is determined, and whether the characteristic data with the largest duty ratio in each cluster is the fault characteristic data or the service index characteristic data of the candidate service equipment can be determined.

According to the embodiment of the disclosure, if the feature data with the largest proportion in the cluster is the fault feature data of the historical fault service device, the service device corresponding to the service index feature data of the candidate service devices in the same cluster with the fault feature data is the fault service device causing the service abnormality, and the fault service device causing the service abnormality is determined from the plurality of candidate service devices according to the association relationship between the service index feature data and the candidate service device identifier.

According to the embodiment of the present disclosure, for example, there are 15 service devices corresponding to the feature data in total in the cluster, where there are 12 fault feature data that are history fault service devices, and 3 service index feature data that are candidate service devices, the similarity between the service index feature data of the 3 candidate service devices and the fault feature data of the history fault service devices may be considered to be large, and the 3 candidate service devices may be considered to be fault service devices that cause abnormal service.

Fig. 4 schematically illustrates a block diagram of a fault locating device according to an embodiment of the present disclosure.

As shown in fig. 4, the fault locating device 400 may include: the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450.

A first determining module 410, configured to determine a failure chain when a service abnormality is detected, where the failure chain includes a plurality of applications having a calling relationship, and each application has a corresponding service device.

A second determining module 420 is configured to determine, from among the plurality of applications, a failed application in the failure chain that causes a service exception.

The first obtaining module 430 is configured to obtain service index feature data of a candidate service device associated with the faulty application.

And a second obtaining module 440, configured to obtain fault characteristic data of the historical fault service device associated with the historical fault application.

And the analysis module 450 is used for performing density cluster analysis on the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment through a density cluster model, and determining the fault service equipment causing abnormal service from the candidate service equipment.

According to an embodiment of the present disclosure, the second determining module 420 may include: the system comprises a first determining sub-module, a constructing sub-module, a first processing sub-module and a second determining sub-module.

A first determination sub-module for determining, for each of a plurality of applications, an initial contribution value of each application to causing a service anomaly.

And the construction submodule is used for constructing initial stable distribution vectors for a plurality of applications in the fault chain according to the initial contribution value of each application to the service abnormality.

And the first processing sub-module is used for carrying out iterative processing on the initial stable distribution vector to obtain the stable distribution vector meeting the preset condition.

And a second determining sub-module, configured to determine an application satisfying the threshold in each application contribution value in the stable distribution vector as a failed application causing a service abnormality in the failure chain.

According to an embodiment of the present disclosure, the second acquisition module 440 may include: the system comprises an analysis sub-module, an acquisition sub-module, a third determination sub-module and a fourth determination sub-module.

And the analysis submodule is used for analyzing the data corresponding to the service indexes according to the time sequence aiming at the service indexes of the historical fault service equipment and determining the starting time of fault occurrence.

And the acquisition submodule is used for acquiring data corresponding to the normal service index in a first time period and data corresponding to the abnormal service index in a second time period according to the starting time of the fault, wherein the first time period comprises a normal time period of the operation of the historical fault service equipment before the starting time of the fault, and the second time period comprises an abnormal time period of the operation of the historical fault service equipment after the starting time of the fault.

And a third determining sub-module for determining a probability value of the data corresponding to the abnormal service index in case that the history fault service apparatus is operated with the data corresponding to the normal service index according to the data corresponding to the normal service index and the data corresponding to the abnormal service index.

And a fourth determining sub-module, configured to determine a probability value of data corresponding to the abnormal service index as fault feature data of the historical fault service device.

According to an embodiment of the present disclosure, the analysis sub-module may include: an acquisition unit and a detection unit.

The acquisition unit is used for performing first-order difference operation on the data corresponding to the service index to obtain a first-order difference result.

And the detection unit is used for carrying out time sequence abnormality detection on the first-order difference result and determining the starting time of fault occurrence.

According to an embodiment of the present disclosure, the analysis module may include: the second processing sub-module and the fifth determining sub-module.

And the second processing sub-module is used for inputting the service index characteristic data of the candidate service equipment and the fault characteristic data of the historical fault service equipment into a density clustering model to perform density clustering processing to obtain a plurality of clusters.

And a fifth determining sub-module for determining a fault service device causing abnormal service from the candidate service devices according to the service index feature data of the candidate service devices in each cluster and the duty ratio of the fault feature data of the history fault service devices.

According to an embodiment of the present disclosure, the fifth determining sub-module may include: a first determination unit, and a second determination unit.

And the first determining unit is used for determining the characteristic data with the largest proportion in the clusters according to the proportion of the service index characteristic data of the candidate service equipment in each cluster and the fault characteristic data of the historical fault service equipment.

And a second determining unit configured to determine, in a case where it is determined that the feature data having the largest duty ratio in the cluster is the failure feature data of the history of failure service apparatuses, a service apparatus corresponding to the service index feature data of the candidate service apparatuses in the same cluster as the failure service apparatus causing the service abnormality.

It should be noted that, the embodiments of the apparatus portion of the present disclosure are the same as or similar to the embodiments of the method portion of the present disclosure, and are not described herein.

Any of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450, or at least some of the functionality of any of them, according to embodiments of the present disclosure, may be implemented in one module. Any one or more of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450 according to embodiments of the present disclosure may be implemented as split into a plurality of modules. Any one or more of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450 according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or any other reasonable manner of hardware or firmware by which the circuit is integrated or packaged, or any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450 according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.

For example, any of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450 may be combined in one module/unit/sub-unit or any of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the first determination module 410, the second determination module 420, the first acquisition module 430, the second acquisition module 440, and the analysis module 450 may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.

Fig. 5 schematically illustrates a block diagram of an electronic device adapted to implement a fault localization method according to an embodiment of the disclosure. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, an electronic device 500 according to an embodiment of the present disclosure includes a processor 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. The processor 501 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 501 may also include on-board memory for caching purposes. The processor 501 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM503, various programs and data required for the operation of the electronic apparatus 500 are stored. The processor 501, ROM 502, and RAM503 are connected to each other by a bus 504. The processor 501 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 502 and/or the RAM 503. Note that the program may be stored in one or more memories other than the ROM 502 and the RAM 503. The processor 501 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in one or more memories.

According to an embodiment of the present disclosure, the electronic device 500 may also include an input/output (I/O) interface 505, the input/output (I/O) interface 505 also being connected to the bus 504. The system 500 may also include one or more of the following components connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 501. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 502 and/or RAM 503 and/or one or more memories other than ROM 502 and RAM 503 described above.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A fault location method, comprising:

determining a fault chain under the condition that service abnormality is detected, wherein the fault chain comprises a plurality of applications with calling relations, and each application has corresponding service equipment;

determining a fault application causing service abnormality in the fault chain from the plurality of applications;

acquiring service index feature data of candidate service equipment associated with the fault application;

acquiring fault characteristic data of a historical fault service device associated with a historical fault application;

and performing density cluster analysis on the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment through a density cluster model, and determining the fault service equipment causing the abnormal service from the candidate service equipment.

2. The method of claim 1, wherein the obtaining fault signature data of a historical fault service device associated with a historical fault application comprises:

aiming at the service index of the historical fault service equipment, analyzing data corresponding to the service index according to a time sequence, and determining the starting time of fault occurrence;

acquiring data corresponding to a normal service index in a first time period and data corresponding to an abnormal service index in a second time period according to the starting time of the fault, wherein the first time period comprises a normal time period of the operation of the historical fault service equipment before the starting time of the fault, and the second time period comprises an abnormal time period of the operation of the historical fault service equipment after the starting time of the fault;

determining a probability value of the data corresponding to the abnormal service index in the case that the history fault service apparatus is operated with the data corresponding to the normal service index according to the data corresponding to the normal service index and the data corresponding to the abnormal service index;

and determining the probability value of the data corresponding to the abnormal service index as the fault characteristic data of the historical fault service equipment.

3. The method of claim 2, wherein the analyzing, in time order, the data corresponding to the service index for the service index of the historical fault service device, and determining the start time of the fault occurrence, includes:

performing first-order difference operation on the data corresponding to the service index to obtain a first-order difference result;

and detecting the time sequence abnormality of the first-order differential result, and determining the starting time of the fault occurrence.

4. The method of claim 1, wherein the performing density cluster analysis on the service index feature data of the candidate service device and the fault feature data of the history fault service device by the density cluster model, determining the fault service device causing the abnormal service from the candidate service devices, includes:

inputting the service index feature data of the candidate service equipment and the fault feature data of the historical fault service equipment into the density clustering model for density clustering processing to obtain a plurality of clusters;

and determining the fault service device causing the abnormal service from the candidate service devices according to the service index characteristic data of the candidate service devices in each cluster and the duty ratio of the fault characteristic data of the historical fault service devices.

5. The method of claim 4, wherein the determining, from the candidate service devices, a fault service device causing the abnormal service according to a duty ratio of service index feature data of the candidate service devices and fault feature data of the history of fault service devices in each of the clusters, comprises:

determining the feature data with the largest proportion in each cluster according to the service index feature data of the candidate service equipment and the proportion of the fault feature data of the historical fault service equipment in each cluster;

and determining the service equipment corresponding to the service index characteristic data of the candidate service equipment in the same cluster as the fault service equipment causing the service abnormality under the condition that the characteristic data with the largest proportion in the cluster is determined to be the fault characteristic data of the historical fault service equipment.

6. The method of claim 1, wherein the determining, from the plurality of applications, a failed application in the failure chain that caused a service anomaly comprises:

determining, for each of the plurality of applications, an initial contribution value of each of the applications to causing the service anomaly;

Constructing initial stable distribution vectors for a plurality of applications in the fault chain according to initial contribution values of each application to the service abnormality;

performing iterative processing on the initial stable distribution vector to obtain a stable distribution vector meeting a preset condition;

and determining the application meeting a threshold value in each application contribution value in the stable distribution vector as a fault application causing the service abnormality in the fault chain.

7. A fault locating device comprising:

the first determining module is used for determining a fault chain under the condition that the service abnormality is detected, wherein the fault chain comprises a plurality of applications with calling relations, and each application is provided with corresponding service equipment;

a second determining module, configured to determine, from the plurality of applications, a failed application that causes a service abnormality in the failure chain;

the first acquisition module is used for acquiring service index characteristic data of candidate service equipment associated with the fault application;

the second acquisition module is used for acquiring fault characteristic data of the historical fault service equipment associated with the historical fault application;

and the analysis module is used for carrying out density cluster analysis on the service index characteristic data of the candidate service equipment and the fault characteristic data of the historical fault service equipment through a density cluster model, and determining the fault service equipment causing the abnormal service from the candidate service equipment.

8. A computer system, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 6.

9. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to implement the method of any of claims 1 to 6.

10. A computer program product comprising computer executable instructions for implementing the method of any one of claims 1 to 6 when executed.