CN111722952A - Fault analysis method, system, equipment and storage medium of business system - Google Patents
Fault analysis method, system, equipment and storage medium of business system Download PDFInfo
- Publication number
- CN111722952A CN111722952A CN202010448356.7A CN202010448356A CN111722952A CN 111722952 A CN111722952 A CN 111722952A CN 202010448356 A CN202010448356 A CN 202010448356A CN 111722952 A CN111722952 A CN 111722952A
- Authority
- CN
- China
- Prior art keywords
- abnormal
- fault
- service
- node
- acquiring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 56
- 230000002159 abnormal effect Effects 0.000 claims abstract description 138
- 238000000034 method Methods 0.000 claims abstract description 28
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims abstract description 8
- 230000004044 response Effects 0.000 claims description 28
- 230000008859 change Effects 0.000 claims description 23
- 238000001914 filtration Methods 0.000 claims description 8
- 238000001514 detection method Methods 0.000 claims description 6
- 230000036541 health Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000003745 diagnosis Methods 0.000 claims 1
- 238000002405 diagnostic procedure Methods 0.000 claims 1
- 230000005856 abnormality Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000004931 aggregating effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
- G06F11/0754—Error or fault detection not based on redundancy by exceeding limits
- G06F11/076—Error or fault detection not based on redundancy by exceeding limits by exceeding a count or rate limit, e.g. word- or bit count limit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Debugging And Monitoring (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
Abstract
The present disclosure provides a method, system, device and storage medium for failure analysis of a business system. The method comprises the following steps: acquiring an operation index in a service operation process; when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index; acquiring a service chain set of a service related to the error object; acquiring a plurality of nodes from the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index; and selecting the nodes with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as fault nodes. According to the fault analysis method provided by the disclosure, when the service operation index is abnormal, the service chain set of the service related to the error object is obtained, and then the fault node is obtained in the service chain set, so that the range of the fault node can be judged based on the service chain information, and the fault generation reason can be positioned more accurately.
Description
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a method, a system, a device, and a storage medium for analyzing a fault of a business system.
Background
The operation of the service system is usually performed by orderly combining a plurality of nodes to jointly participate in the completed service chain. When the operation of the service system is abnormal, one or more nodes causing the abnormality on the service chain need to be quickly and accurately found out, so that the fault can be accurately eliminated, and the normal operation of the service function can be recovered.
In the prior art, a service expert usually needs to perform analysis and judgment according to operation and maintenance experience of the service expert to determine a service abnormal node when a service system operates abnormally.
Besides the above manual participation method, the range of the service system associated node (the nodes are not necessarily on the service chain) can be determined as much as possible, then the operation data of the nodes are collected, and the specific change of the operation data is analyzed through a specific algorithm, so as to judge whether the service system associated node is an abnormal node.
Further, for the location of the failure root cause of the service abnormal node, the root cause of the service failure is mainly determined according to the alarm information of the abnormal/failed node or the abnormal change information of the operation index.
However, the above prior art cannot accurately obtain the association relationship between nodes and narrow the range of the failed node, so that the accuracy of locating the cause of the failure according to the abnormal node is low, and the goal of quickly locating the failed node cannot be achieved due to the large calculation workload of collecting the system associated node data. Secondly, without establishing the association relationship of the configuration objects, the multidimensional association configuration objects of the node, including platform layer, network layer and device layer configuration objects, cannot be automatically obtained, and thus complete root cause analysis work cannot be performed. Meanwhile, for the root cause positioning of the fault, the root cause is positioned only according to the alarm information of the abnormal node or the abnormal change information of the operation index, so that the root cause of the fault cannot be comprehensively analyzed, and the accurate root cause positioning cannot be realized.
Disclosure of Invention
In order to solve the problems or partial problems in the prior art, the invention provides a fault analysis method, system, device and storage medium of a service system, which can judge the range of a fault node based on service chain information, and further more accurately locate the fault generation reason.
According to a first aspect of the present invention, an embodiment of the present invention provides a method for analyzing a fault of a service system, including: acquiring an operation index in a service operation process; when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index; acquiring a service chain set of a service related to the error object; acquiring a plurality of nodes from the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index; and selecting the nodes with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as fault nodes.
In the above embodiment of the present invention, by acquiring the service chain information related to the error object when the service operation index is abnormal, and selecting the fault node from the plurality of nodes acquired according to the service chain set, a more accurate fault node can be acquired by combining with the service chain information analysis, thereby providing a basis for the subsequent fault root cause analysis.
In some embodiments of the invention, the operation index comprises: service success rate and response time.
In some embodiments of the present invention, obtaining the set of service chains of the service related to the error object includes: acquiring identification information of a service related to the error object; and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
In the above embodiment of the present invention, each service identifier related to an error object is obtained, and then the service chain of the service is found through the same service identifier, and finally the service chain set corresponding to the error object is obtained by summarizing, and all service chain information and node information associated with the error object can be obtained, so that the fault analysis of the service system can obtain an accurate node range based on the associated service chain information, and further obtain a more accurate fault node.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index includes: and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
In some embodiments of the present invention, acquiring a plurality of nodes in the service chain set according to a failure node determination mechanism corresponding to the abnormal operation index includes: and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index further includes: and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
In some embodiments of the present invention, acquiring the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further includes: and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the present invention, the method for diagnosing a faulty node further comprises: and acquiring a multi-dimensional associated configuration object of the fault node.
In some embodiments of the invention, the multi-dimensional associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
In some embodiments of the present invention, the method for diagnosing a faulty node further comprises: and positioning the fault reason according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
By acquiring the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node, the embodiment of the invention can collect more comprehensive clues which may cause the fault, and provides a basis for the accuracy of root cause positioning.
In some embodiments of the present invention, locating a fault cause according to feature index data corresponding to a multidimensional associated configuration object of the fault node includes: filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object; acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data; according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data; determining a retrospective relationship between the characteristic index data; and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
In the above embodiment of the present invention, the change information, the alarm information, and the user access information of the multi-dimensional associated configuration object after filtering and summarizing are output, and the corresponding configuration object and the corresponding characteristic index are recommended as the fault cause according to the abnormal degree of the characteristic index data, the weight obtained from the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, and the retrospective relationship between the characteristic index data, so that the cause that may cause the fault can be recommended according to the probability and the association relationship, thereby realizing standardized and intelligent fault cause output, reducing the root cause positioning deviation caused by the difference of human experiences, and realizing standardized and efficient root cause positioning.
According to a second aspect of the present invention, an embodiment of the present invention provides a fault analysis system for a business system, including: the operation index acquisition module is used for acquiring operation indexes in the service operation process; the error object acquisition module is used for acquiring an error object corresponding to the abnormal operation index when the operation index is abnormal; a service chain set obtaining module, configured to obtain a service chain set of a service related to the error object; the node acquisition module is used for acquiring a plurality of nodes in the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index; and the fault node acquisition module is used for selecting the node with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as the fault node.
In the above embodiment of the present invention, by acquiring the service chain information related to the error object when the service operation index is abnormal, and selecting the fault node from the plurality of nodes acquired according to the service chain set, a more accurate fault node can be acquired by combining with the service chain information analysis, thereby providing a basis for the subsequent fault root cause analysis.
In some embodiments of the invention, the operation index comprises: service success rate and response time.
In some embodiments of the present invention, obtaining the set of service chains of the service related to the error object includes: acquiring identification information of a service related to the error object; and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
In the above embodiment of the present invention, each service identifier related to an error object is obtained, and then the service chain of the service is found through the same service identifier, and finally the service chain set corresponding to the error object is obtained by summarizing, and all service chain information and node information associated with the error object can be obtained, so that the fault analysis of the service system can obtain an accurate node range based on the associated service chain information, and further obtain a more accurate fault node.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index includes: and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
In some embodiments of the present invention, acquiring a plurality of nodes in the service chain set according to a failure node determination mechanism corresponding to the abnormal operation index includes: and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index further includes: and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
In some embodiments of the present invention, acquiring the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further includes: and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the invention, the faulty node diagnostic system further comprises: and the associated configuration object acquisition module is used for acquiring the multi-dimensional associated configuration object of the fault node.
In some embodiments of the invention, the multi-dimensional associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
In some embodiments of the invention, the faulty node diagnostic system further comprises: and the root cause analysis module is used for positioning the fault cause according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
By acquiring the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node, the embodiment of the invention can collect more comprehensive clues which may cause the fault, and provides a basis for the accuracy of root cause positioning.
In some embodiments of the present invention, locating a fault cause according to feature index data corresponding to a multidimensional associated configuration object of the fault node includes: filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object; acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data; according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data; determining a retrospective relationship between the characteristic index data; and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
In the above embodiment of the present invention, the change information, the alarm information, and the user access information of the multi-dimensional associated configuration object after filtering and summarizing are output, and the corresponding configuration object and the corresponding characteristic index are recommended as the fault cause according to the abnormal degree of the characteristic index data, the weight obtained from the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, and the retrospective relationship between the characteristic index data, so that the cause that may cause the fault can be recommended according to the probability and the association relationship, thereby realizing standardized and intelligent fault cause output, reducing the root cause positioning deviation caused by the difference of human experiences, and realizing standardized and efficient root cause positioning.
According to a third aspect of the present invention, an embodiment of the present invention provides a computer storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause a computer to perform the following operations: the operations include the steps included in the fault analysis method according to any one of the above embodiments.
According to a fourth aspect of the present invention, the present invention provides a computer device including a memory and a processor, the memory being used for storing one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, can implement the fault analysis method according to any one of the above embodiments.
As can be seen from the above description, the method, the system, the storage medium, and the device for analyzing a fault of a service system according to the embodiments of the present invention obtain service chain information related to an error object when a service operation index is abnormal, and select a fault node from a plurality of nodes obtained according to the service chain set, so that a more accurate fault node can be obtained by analyzing the service chain information, and a basis is provided for accuracy of further fault root cause analysis.
Drawings
FIG. 1 is a flow diagram illustrating a method for fault analysis of a business system in accordance with one embodiment of the present invention;
fig. 2 is a flowchart illustrating a fault analysis method when the abnormal operation index is the service success rate in step S12 in fig. 1;
fig. 3 is a flowchart illustrating a fault analysis method when the abnormal operation index is the response time in step S12 of fig. 1;
FIG. 4 is a detailed flowchart of step S17 in FIG. 1;
fig. 5 is an architecture diagram of a fault analysis system of a business system according to an embodiment of the present invention.
Detailed Description
Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known modules, units and their interconnections, links, communications or operations with each other are not shown or described in detail. Furthermore, the described features, architectures, or functions can be combined in any manner in one or more implementations. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. It will also be readily understood that the modules or units or processes of the embodiments described herein and illustrated in the figures can be combined and designed in a wide variety of different configurations.
Fig. 1 is a flowchart illustrating a method for analyzing a fault of a service system according to an embodiment of the present invention.
As shown in fig. 1, in an embodiment of the present invention, the fault analysis method may include: step S11, step S12, step S13, step S14, and step S15, which are specifically described below.
In step S11, an operation index in the service operation process is obtained. In an exemplary embodiment, the operation index may include, but is not limited to, a service success rate, a response time.
In step S12, when the operation index is abnormal, an error object corresponding to the abnormal operation index is acquired.
In step S13, a set of business chains of the business involved in the error object is obtained.
In an optional implementation manner, the service chain set may be obtained by obtaining identification information of a service related to the error object, and further obtaining a service chain of the service related to the error object according to the identification information and aggregating the service chain. By acquiring each service identifier related to the error object, finding the service chain of the service through the same service identifier, finally summarizing to obtain a service chain set corresponding to the error object, all service chain information and node information related to the error object can be obtained, so that fault analysis of a service system can be based on the related service chain information, an accurate node range can be obtained, and a more accurate fault node can be obtained.
In an exemplary embodiment, the service chain is obtained on the premise that each time a service is made, the service initiating end node generates tracking number information including a service name and a unique identifier, and the midway node holds the unique identifier and outputs corresponding service timing sequence information to the final node, so that the tracking number information becomes global uniform identifier information for marking the service chain corresponding to the service.
In step S14, a plurality of nodes are obtained from the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index.
In step S15, a node, the number of occurrences of which exceeds a predetermined threshold, is selected from the plurality of nodes as a failed node.
By adopting the fault analysis method of the embodiment of the invention, the service chain information related to the error object when the service operation index is abnormal is obtained, and the fault node is selected from the plurality of nodes obtained according to the service chain set, so that more accurate fault node can be obtained by combining the service chain information analysis, and a basis is provided for the accuracy of further fault root cause analysis.
In an optional embodiment of the present invention, when the abnormal operation index in step S12 is a service success rate, as shown in fig. 2, the method for analyzing a fault with respect to the abnormal service success rate may specifically include: step S21, step S22, step S23, step S24, and step S25, which are specifically described below.
In step S21, the service success rate is abnormal. In an optional embodiment, the abnormal service success rate may be that the service success rate is not within a predetermined service success rate range, or that a decrease value of the service success rate exceeds a predetermined decrease value.
In step S22, error code objects whose number increases beyond a first threshold are acquired as error objects.
In step S23, a set of business chains of the business involved in the error object is obtained.
In step S24, the last error reporting node on each service chain in the service chain set is obtained to obtain a plurality of service nodes.
In step S25, a node whose present number exceeds a predetermined threshold is selected from the plurality of service nodes as a failed node.
The fault node is judged through the service chain information, so that the omission and excessive acquisition of abnormal nodes can be avoided, the fault abnormal node can be more accurately acquired by combining the service chain information, the dependence on service experts and the workload of manual participation can be reduced, and the standardized, automatic and intelligent fault node judgment is realized.
In an optional embodiment of the present invention, when the abnormal operation indicator in step S12 is a response time, as shown in fig. 3, the method for analyzing a fault of an abnormal response time may specifically include: step S31, step S32, step S33, step S34, and step S35, which are specifically described below.
In step S31, the response time is abnormal. In alternative embodiments, the response time anomaly may be that the response time is not within a predetermined range of response times, or that the increase in response time exceeds a predetermined increase value.
In step S32, a business object whose response time increases beyond the second threshold is acquired as an error object.
In step S33, a set of business chains of the business involved in the error object is obtained.
In step S34, the node with the largest processing time change on each service chain in the service chain set is obtained to obtain a plurality of service nodes.
In step S35, a node whose present number exceeds a predetermined threshold is selected from the plurality of service nodes as a failed node.
Similarly, the fault node is judged through the service chain information, so that not only can abnormal nodes be prevented from being omitted and excessively obtained, but also the fault abnormal nodes can be more accurately obtained by combining the service chain information, the dependence on service experts and the workload of manual participation can be reduced, and the standardized, automatic and intelligent fault node judgment is realized.
Optionally, the method for analyzing a fault of a service system according to an embodiment of the present invention further includes step S16 and step S17 (as shown by a dashed box in fig. 1): step S16, acquiring a multi-dimensional associated configuration object of a fault node; and step S17, positioning the fault reason according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node. In an exemplary embodiment, the multi-dimensional associated configuration object may include, but is not limited to, one or more of the following: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
Fig. 4 is a detailed flowchart of step S17.
As shown in fig. 4, the positioning of the fault cause according to the feature index data corresponding to the multidimensional associated configuration object of the fault node may specifically include: step S71, step S72, step S73, step S74, and step S75, which are specifically described below.
In step S71, change information, alarm information, and user access information of the multi-dimensional associated configuration object are filtered and summarized.
In step S72, the degree and form of abnormality of the feature index data are acquired from the health examination information and the abnormality detection information corresponding to the feature index data.
In step S73, a weight is assigned to the feature index data based on the degree of abnormality of the feature index data and the degree of similarity between the abnormal form of the feature index data and the abnormal form of the operation index.
In step S74, a retrospective relationship between the feature index data is determined. Wherein, the tracing relationship can be: when two abnormal characteristic index data A and two abnormal characteristic index data B exist and one abnormal characteristic index data A causes the other abnormal characteristic index data B to be abnormal, a tracing relation exists between the two abnormal characteristic index data A and the two abnormal characteristic index data B.
In an optional embodiment, the association relationship among the indexes obtained through manual experience is generated into an index relationship tree, so that when the abnormal characteristic index data is obtained, the tracing relationship among the abnormal characteristic index data is determined according to the index relationship tree.
In step S75, change information, alarm information, and user access information of the filtered and summarized multidimensional associated configuration object are output, and a corresponding configuration object and a corresponding characteristic index are recommended as a failure cause according to the weight of the characteristic index data and the retroactive relationship. In an optional implementation manner, a predetermined number of configuration objects and feature indexes are obtained according to the weight ranking for recommendation.
In an optional embodiment, recommending according to the retrospective relationship may include: according to the tracing relation among the abnormal characteristic index data, when determining that one lower layer abnormal characteristic index data causes one upper layer characteristic index data to be abnormal, filtering the upper layer abnormal index, thereby screening out the root cause of the fault.
By the method, the reasons which possibly cause the fault can be recommended according to the probability and the incidence relation, so that standardized and intelligentized (normalized and automated) fault reason output is realized, the root cause positioning deviation caused by personnel experience difference can be reduced, and standardized and efficient root cause positioning is realized.
Fig. 5 is an architecture diagram of a fault analysis system of a business system according to an embodiment of the present invention.
As shown in fig. 5, the monitoring system includes:
an operation index obtaining module 510, configured to obtain an operation index in a service operation process. In an exemplary embodiment, the operation index may include, but is not limited to, a service success rate, a response time.
An error object obtaining module 520, configured to obtain, when the operation index is abnormal, an error object corresponding to the abnormal operation index.
In an optional implementation manner, when the service success rate is abnormal, acquiring error code objects with obviously increased number as error objects; and when the response time is abnormal, acquiring a business object with obviously increased response time as an error object. Illustratively, when the service success rate is abnormal, acquiring error code objects of which the number increases and exceeds a first threshold value as error objects; and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
A service chain set obtaining module 530, configured to obtain a service chain set of a service related to the error object.
In an optional implementation manner, the service chain set may be obtained by obtaining identification information of a service related to the error object, and further obtaining a service chain of the service related to the error object according to the identification information and aggregating the service chain. By acquiring each service identifier related to the error object, finding the service chain of the service through the same service identifier, finally summarizing to obtain a service chain set corresponding to the error object, all service chain information and node information related to the error object can be obtained, so that fault analysis of a service system can be based on the related service chain information, an accurate node range can be obtained, and a more accurate fault node can be obtained.
A node obtaining module 540, configured to obtain multiple nodes in the service chain set according to a failure node judgment mechanism corresponding to the abnormal operation index.
In an optional embodiment, when the service success rate is abnormal, the last error-reporting node on each service chain in the service chain set is used as one of the plurality of nodes; and when the response time is abnormal, taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
A failed node obtaining module 550, configured to select, as a failed node, a node whose occurrence frequency exceeds a predetermined threshold from the multiple nodes.
And an associated configuration object obtaining module 560, configured to obtain a multidimensional associated configuration object of the failed node. In an exemplary embodiment, the multi-dimensional associated configuration object may include, but is not limited to, one or more of the following: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
And the root cause analysis module 570 is configured to locate a fault cause according to the feature index data corresponding to the multidimensional associated configuration object of the fault node.
In an optional embodiment, the locating a fault cause according to feature index data corresponding to the multidimensional associated configuration object of the fault node includes: filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object; acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data; according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data; determining a retrospective relationship between the characteristic index data; and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
By adopting the fault analysis system of the embodiment of the invention, the service chain information related to the error object when the service operation index is abnormal is obtained, and the fault node is selected from the plurality of nodes obtained according to the service chain set, so that more accurate fault node can be obtained by combining the service chain information analysis, and a basis is provided for the accuracy of further fault root cause analysis. And the reason which can cause the fault is recommended according to the probability and the incidence relation, so that the standardized and intelligent (standardized and automatic) fault reason output is realized, the root cause positioning deviation caused by the experience difference of personnel can be reduced, and the standardized and efficient root cause positioning is realized.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.
Correspondingly, the embodiment of the invention also provides a computer readable storage medium, on which computer readable instructions or a program are stored, and when the computer readable instructions or the program are executed by a processor, the computer is enabled to execute the following operations: the operation includes the steps included in the fault analysis method according to any of the above embodiments, and details are not repeated here. Wherein the storage medium may include: such as optical disks, hard disks, floppy disks, flash memory, magnetic tape, etc.
In addition, the embodiment of the present invention further provides a computer device including a memory and a processor, where the memory is used for storing one or more computer instructions or programs, and when the one or more computer instructions or programs are executed by the processor, the fault analysis method according to any one of the above embodiments can be implemented. The computer device may be, for example, a server, a desktop computer, a notebook computer, a tablet computer, or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.
Claims (24)
1. A fault analysis method of a business system is characterized by comprising the following steps:
acquiring an operation index in a service operation process;
when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index;
acquiring a service chain set of a service related to the error object;
acquiring a plurality of nodes from the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index;
and selecting the nodes with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as fault nodes.
2. The fault analysis method of claim 1, wherein the operational indicators comprise: service success rate and response time.
3. The method of claim 1, wherein obtaining a set of business chains for the business involved in the erroneous object comprises:
acquiring identification information of a service related to the error object;
and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
4. The fault analysis method according to claim 2, wherein when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index comprises:
and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
5. The method of claim 4, wherein obtaining the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index comprises:
and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
6. The method of analyzing a fault according to claim 4, wherein when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index further comprises:
and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
7. The method of claim 6, wherein obtaining the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further comprises:
and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
8. The fault analysis method according to any one of claims 1 to 7, wherein the faulty node diagnosis method further comprises:
and acquiring a multi-dimensional associated configuration object of the fault node.
9. The fault analysis method of claim 8, wherein the multi-dimensional associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
10. The fault analysis method of claim 9, wherein the faulty node diagnostic method further comprises:
and positioning the fault reason according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
11. The fault analysis method according to claim 10, wherein locating the cause of the fault according to the feature index data corresponding to the multidimensional associated configuration object of the fault node comprises:
filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object;
acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data;
according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data;
determining a retrospective relationship between the characteristic index data;
and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
12. A fault analysis system for a business system, the fault analysis system comprising:
the operation index acquisition module is used for acquiring operation indexes in the service operation process;
the error object acquisition module is used for acquiring an error object corresponding to the abnormal operation index when the operation index is abnormal;
a service chain set obtaining module, configured to obtain a service chain set of a service related to the error object;
the node acquisition module is used for acquiring a plurality of nodes in the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index;
and the fault node acquisition module is used for selecting the node with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as the fault node.
13. The fault analysis system of claim 12, wherein the operational indicators comprise: service success rate and response time.
14. The fault analysis system of claim 12, wherein obtaining a set of business chains for the business involved in the erroneous object comprises:
acquiring identification information of a service related to the error object;
and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
15. The fault analysis system according to claim 13, wherein when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index comprises:
and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
16. The fault analysis system according to claim 15, wherein obtaining the plurality of nodes in the service chain set according to the fault node determination mechanism corresponding to the abnormal operation index comprises:
and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
17. The fault analysis system of claim 15, wherein when the operation index is abnormal, obtaining an error object corresponding to the abnormal operation index further comprises:
and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
18. The system according to claim 17, wherein obtaining the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further comprises:
and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
19. The fault analysis system according to any one of claims 12-18, wherein the faulty node diagnostic system further comprises:
and the associated configuration object acquisition module is used for acquiring the multi-dimensional associated configuration object of the fault node.
20. The fault analysis system of claim 19, wherein the multi-dimensional, associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
21. The fault analysis system of claim 20, wherein the faulty node diagnostic system further comprises:
and the root cause analysis module is used for positioning the fault cause according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
22. The fault analysis system of claim 21, wherein locating a fault cause according to the feature metric data corresponding to the multidimensional associated configuration object of the fault node comprises:
filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object;
acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data;
according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data;
determining a retrospective relationship between the characteristic index data;
and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
23. A computer storage medium storing computer software instructions for execution by a processor to implement the fault analysis method of any one of claims 1-11.
24. A computer device comprising a memory and a processor;
wherein the memory is to store one or more computer instructions that are executed by the processor to implement the fault analysis method of any of claims 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010448356.7A CN111722952A (en) | 2020-05-25 | 2020-05-25 | Fault analysis method, system, equipment and storage medium of business system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010448356.7A CN111722952A (en) | 2020-05-25 | 2020-05-25 | Fault analysis method, system, equipment and storage medium of business system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111722952A true CN111722952A (en) | 2020-09-29 |
Family
ID=72565006
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010448356.7A Pending CN111722952A (en) | 2020-05-25 | 2020-05-25 | Fault analysis method, system, equipment and storage medium of business system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111722952A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112052151A (en) * | 2020-10-09 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Fault root cause analysis method, device, equipment and storage medium |
CN112308455A (en) * | 2020-11-20 | 2021-02-02 | 深圳前海微众银行股份有限公司 | Root cause positioning method, device, equipment and computer storage medium |
CN112579402A (en) * | 2020-12-14 | 2021-03-30 | 中国建设银行股份有限公司 | Method and device for positioning faults of application system |
CN112698975A (en) * | 2020-12-14 | 2021-04-23 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
CN112860508A (en) * | 2021-01-13 | 2021-05-28 | 支付宝(杭州)信息技术有限公司 | Abnormal positioning method, device and equipment based on knowledge graph |
CN113271224A (en) * | 2021-05-17 | 2021-08-17 | 中国邮政储蓄银行股份有限公司 | Node positioning method and device, storage medium and electronic device |
CN113269648A (en) * | 2021-06-10 | 2021-08-17 | 中国建设银行股份有限公司 | Fault node positioning method and device, storage medium and electronic equipment |
CN113570084A (en) * | 2021-07-29 | 2021-10-29 | 重庆允成互联网科技有限公司 | Method and system for generating fault analysis report based on equipment maintenance |
CN114498587A (en) * | 2022-03-25 | 2022-05-13 | 中国工商银行股份有限公司 | Fault service positioning method, system and device, data processor and related products |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110191630A1 (en) * | 2010-01-29 | 2011-08-04 | International Business Machines Corporation | Diagnosing a fault incident in a data center |
CN106779505A (en) * | 2017-02-28 | 2017-05-31 | 中国南方电网有限责任公司 | A kind of transmission line malfunction method for early warning driven based on big data and system |
CN108009040A (en) * | 2017-12-12 | 2018-05-08 | 杭州时趣信息技术有限公司 | A kind of definite failure root because method, system and computer-readable recording medium |
CN108833184A (en) * | 2018-06-29 | 2018-11-16 | 腾讯科技(深圳)有限公司 | Service fault localization method, device, computer equipment and storage medium |
CN108900353A (en) * | 2018-07-18 | 2018-11-27 | 平安科技(深圳)有限公司 | Fault alarming method and terminal device |
CN110166264A (en) * | 2018-02-11 | 2019-08-23 | 北京三快在线科技有限公司 | A kind of Fault Locating Method, device and electronic equipment |
CN110493042A (en) * | 2019-08-16 | 2019-11-22 | 中国联合网络通信集团有限公司 | Method for diagnosing faults, device and server |
CN110955575A (en) * | 2019-11-14 | 2020-04-03 | 国网浙江省电力有限公司信息通信分公司 | Business system fault positioning method based on correlation analysis model |
CN110995468A (en) * | 2019-11-13 | 2020-04-10 | 上海钧正网络科技有限公司 | System fault processing method, device, equipment and storage medium of system to be analyzed |
CN111190876A (en) * | 2019-12-31 | 2020-05-22 | 天津浪淘科技股份有限公司 | Log management system and operation method thereof |
-
2020
- 2020-05-25 CN CN202010448356.7A patent/CN111722952A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110191630A1 (en) * | 2010-01-29 | 2011-08-04 | International Business Machines Corporation | Diagnosing a fault incident in a data center |
CN106779505A (en) * | 2017-02-28 | 2017-05-31 | 中国南方电网有限责任公司 | A kind of transmission line malfunction method for early warning driven based on big data and system |
CN108009040A (en) * | 2017-12-12 | 2018-05-08 | 杭州时趣信息技术有限公司 | A kind of definite failure root because method, system and computer-readable recording medium |
CN110166264A (en) * | 2018-02-11 | 2019-08-23 | 北京三快在线科技有限公司 | A kind of Fault Locating Method, device and electronic equipment |
CN108833184A (en) * | 2018-06-29 | 2018-11-16 | 腾讯科技(深圳)有限公司 | Service fault localization method, device, computer equipment and storage medium |
CN108900353A (en) * | 2018-07-18 | 2018-11-27 | 平安科技(深圳)有限公司 | Fault alarming method and terminal device |
CN110493042A (en) * | 2019-08-16 | 2019-11-22 | 中国联合网络通信集团有限公司 | Method for diagnosing faults, device and server |
CN110995468A (en) * | 2019-11-13 | 2020-04-10 | 上海钧正网络科技有限公司 | System fault processing method, device, equipment and storage medium of system to be analyzed |
CN110955575A (en) * | 2019-11-14 | 2020-04-03 | 国网浙江省电力有限公司信息通信分公司 | Business system fault positioning method based on correlation analysis model |
CN111190876A (en) * | 2019-12-31 | 2020-05-22 | 天津浪淘科技股份有限公司 | Log management system and operation method thereof |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112052151A (en) * | 2020-10-09 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Fault root cause analysis method, device, equipment and storage medium |
CN112052151B (en) * | 2020-10-09 | 2022-02-18 | 腾讯科技(深圳)有限公司 | Fault root cause analysis method, device, equipment and storage medium |
CN112308455A (en) * | 2020-11-20 | 2021-02-02 | 深圳前海微众银行股份有限公司 | Root cause positioning method, device, equipment and computer storage medium |
CN112308455B (en) * | 2020-11-20 | 2024-04-09 | 深圳前海微众银行股份有限公司 | Root cause positioning method, root cause positioning device, root cause positioning equipment and computer storage medium |
CN112579402A (en) * | 2020-12-14 | 2021-03-30 | 中国建设银行股份有限公司 | Method and device for positioning faults of application system |
CN112698975A (en) * | 2020-12-14 | 2021-04-23 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
CN112579402B (en) * | 2020-12-14 | 2024-08-30 | 中国建设银行股份有限公司 | Method and device for positioning faults of application system |
CN112698975B (en) * | 2020-12-14 | 2022-09-27 | 北京大学 | Fault root cause positioning method and system of micro-service architecture information system |
CN112860508B (en) * | 2021-01-13 | 2023-02-28 | 支付宝(杭州)信息技术有限公司 | Abnormal positioning method, device and equipment based on knowledge graph |
CN112860508A (en) * | 2021-01-13 | 2021-05-28 | 支付宝(杭州)信息技术有限公司 | Abnormal positioning method, device and equipment based on knowledge graph |
CN113271224A (en) * | 2021-05-17 | 2021-08-17 | 中国邮政储蓄银行股份有限公司 | Node positioning method and device, storage medium and electronic device |
CN113269648A (en) * | 2021-06-10 | 2021-08-17 | 中国建设银行股份有限公司 | Fault node positioning method and device, storage medium and electronic equipment |
CN113570084B (en) * | 2021-07-29 | 2023-12-29 | 重庆允丰科技有限公司 | Method and system for generating fault analysis report based on equipment maintenance |
CN113570084A (en) * | 2021-07-29 | 2021-10-29 | 重庆允成互联网科技有限公司 | Method and system for generating fault analysis report based on equipment maintenance |
CN114498587A (en) * | 2022-03-25 | 2022-05-13 | 中国工商银行股份有限公司 | Fault service positioning method, system and device, data processor and related products |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111722952A (en) | Fault analysis method, system, equipment and storage medium of business system | |
CN111064614B (en) | Fault root cause positioning method, device, equipment and storage medium | |
US9672085B2 (en) | Adaptive fault diagnosis | |
WO2018103453A1 (en) | Network detection method and apparatus | |
WO2021179574A1 (en) | Root cause localization method, device, computer apparatus, and storage medium | |
CN107124289B (en) | Weblog time alignment method, device and host | |
EP3663919B1 (en) | System and method of automated fault correction in a network environment | |
CN112769605B (en) | Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform | |
US9524223B2 (en) | Performance metrics of a computer system | |
CN110955550A (en) | Cloud platform fault positioning method, device, equipment and storage medium | |
CN113010389A (en) | Training method, fault prediction method, related device and equipment | |
US20240272975A1 (en) | Method and system for upgrading cpe firmware | |
CN107391335B (en) | Method and equipment for checking health state of cluster | |
CN112751711B (en) | Alarm information processing method and device, storage medium and electronic equipment | |
CN111984442A (en) | Method and device for detecting abnormality of computer cluster system, and storage medium | |
CN115514619B (en) | Alarm convergence method and system | |
CN113392893B (en) | Method, device, storage medium and computer program product for locating business fault | |
CN112699007A (en) | Method, system, network device and storage medium for monitoring machine performance | |
WO2019041870A1 (en) | Method, device, and storage medium for locating failure cause | |
CN115766402B (en) | Method and device for filtering server fault root cause, storage medium and electronic device | |
CN111913824B (en) | Method for determining data link fault cause and related equipment | |
CN115576738A (en) | Method and system for realizing equipment fault determination based on chip analysis | |
CN117170915A (en) | Data center equipment fault prediction method and device and computer equipment | |
CN110489260B (en) | Fault identification method and device and BMC | |
CN112416896A (en) | Data abnormity warning method and device, storage medium and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |