CN111722952A - Fault analysis method, system, equipment and storage medium of business system - Google Patents

Fault analysis method, system, equipment and storage medium of business system Download PDF

Info

Publication number
CN111722952A
CN111722952A CN202010448356.7A CN202010448356A CN111722952A CN 111722952 A CN111722952 A CN 111722952A CN 202010448356 A CN202010448356 A CN 202010448356A CN 111722952 A CN111722952 A CN 111722952A
Authority
CN
China
Prior art keywords
abnormal
fault
service
node
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010448356.7A
Other languages
Chinese (zh)
Inventor
段国强
郝丽萍
王艳华
谢朝杰
李世宁
杜旭
范宏伟
王欣
张明
王士强
李琪
韩广乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202010448356.7A priority Critical patent/CN111722952A/en
Publication of CN111722952A publication Critical patent/CN111722952A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/076Error or fault detection not based on redundancy by exceeding limits by exceeding a count or rate limit, e.g. word- or bit count limit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The present disclosure provides a method, system, device and storage medium for failure analysis of a business system. The method comprises the following steps: acquiring an operation index in a service operation process; when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index; acquiring a service chain set of a service related to the error object; acquiring a plurality of nodes from the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index; and selecting the nodes with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as fault nodes. According to the fault analysis method provided by the disclosure, when the service operation index is abnormal, the service chain set of the service related to the error object is obtained, and then the fault node is obtained in the service chain set, so that the range of the fault node can be judged based on the service chain information, and the fault generation reason can be positioned more accurately.

Description

Fault analysis method, system, equipment and storage medium of business system
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a method, a system, a device, and a storage medium for analyzing a fault of a business system.
Background
The operation of the service system is usually performed by orderly combining a plurality of nodes to jointly participate in the completed service chain. When the operation of the service system is abnormal, one or more nodes causing the abnormality on the service chain need to be quickly and accurately found out, so that the fault can be accurately eliminated, and the normal operation of the service function can be recovered.
In the prior art, a service expert usually needs to perform analysis and judgment according to operation and maintenance experience of the service expert to determine a service abnormal node when a service system operates abnormally.
Besides the above manual participation method, the range of the service system associated node (the nodes are not necessarily on the service chain) can be determined as much as possible, then the operation data of the nodes are collected, and the specific change of the operation data is analyzed through a specific algorithm, so as to judge whether the service system associated node is an abnormal node.
Further, for the location of the failure root cause of the service abnormal node, the root cause of the service failure is mainly determined according to the alarm information of the abnormal/failed node or the abnormal change information of the operation index.
However, the above prior art cannot accurately obtain the association relationship between nodes and narrow the range of the failed node, so that the accuracy of locating the cause of the failure according to the abnormal node is low, and the goal of quickly locating the failed node cannot be achieved due to the large calculation workload of collecting the system associated node data. Secondly, without establishing the association relationship of the configuration objects, the multidimensional association configuration objects of the node, including platform layer, network layer and device layer configuration objects, cannot be automatically obtained, and thus complete root cause analysis work cannot be performed. Meanwhile, for the root cause positioning of the fault, the root cause is positioned only according to the alarm information of the abnormal node or the abnormal change information of the operation index, so that the root cause of the fault cannot be comprehensively analyzed, and the accurate root cause positioning cannot be realized.
Disclosure of Invention
In order to solve the problems or partial problems in the prior art, the invention provides a fault analysis method, system, device and storage medium of a service system, which can judge the range of a fault node based on service chain information, and further more accurately locate the fault generation reason.
According to a first aspect of the present invention, an embodiment of the present invention provides a method for analyzing a fault of a service system, including: acquiring an operation index in a service operation process; when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index; acquiring a service chain set of a service related to the error object; acquiring a plurality of nodes from the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index; and selecting the nodes with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as fault nodes.
In the above embodiment of the present invention, by acquiring the service chain information related to the error object when the service operation index is abnormal, and selecting the fault node from the plurality of nodes acquired according to the service chain set, a more accurate fault node can be acquired by combining with the service chain information analysis, thereby providing a basis for the subsequent fault root cause analysis.
In some embodiments of the invention, the operation index comprises: service success rate and response time.
In some embodiments of the present invention, obtaining the set of service chains of the service related to the error object includes: acquiring identification information of a service related to the error object; and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
In the above embodiment of the present invention, each service identifier related to an error object is obtained, and then the service chain of the service is found through the same service identifier, and finally the service chain set corresponding to the error object is obtained by summarizing, and all service chain information and node information associated with the error object can be obtained, so that the fault analysis of the service system can obtain an accurate node range based on the associated service chain information, and further obtain a more accurate fault node.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index includes: and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
In some embodiments of the present invention, acquiring a plurality of nodes in the service chain set according to a failure node determination mechanism corresponding to the abnormal operation index includes: and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index further includes: and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
In some embodiments of the present invention, acquiring the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further includes: and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the present invention, the method for diagnosing a faulty node further comprises: and acquiring a multi-dimensional associated configuration object of the fault node.
In some embodiments of the invention, the multi-dimensional associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
In some embodiments of the present invention, the method for diagnosing a faulty node further comprises: and positioning the fault reason according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
By acquiring the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node, the embodiment of the invention can collect more comprehensive clues which may cause the fault, and provides a basis for the accuracy of root cause positioning.
In some embodiments of the present invention, locating a fault cause according to feature index data corresponding to a multidimensional associated configuration object of the fault node includes: filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object; acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data; according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data; determining a retrospective relationship between the characteristic index data; and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
In the above embodiment of the present invention, the change information, the alarm information, and the user access information of the multi-dimensional associated configuration object after filtering and summarizing are output, and the corresponding configuration object and the corresponding characteristic index are recommended as the fault cause according to the abnormal degree of the characteristic index data, the weight obtained from the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, and the retrospective relationship between the characteristic index data, so that the cause that may cause the fault can be recommended according to the probability and the association relationship, thereby realizing standardized and intelligent fault cause output, reducing the root cause positioning deviation caused by the difference of human experiences, and realizing standardized and efficient root cause positioning.
According to a second aspect of the present invention, an embodiment of the present invention provides a fault analysis system for a business system, including: the operation index acquisition module is used for acquiring operation indexes in the service operation process; the error object acquisition module is used for acquiring an error object corresponding to the abnormal operation index when the operation index is abnormal; a service chain set obtaining module, configured to obtain a service chain set of a service related to the error object; the node acquisition module is used for acquiring a plurality of nodes in the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index; and the fault node acquisition module is used for selecting the node with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as the fault node.
In the above embodiment of the present invention, by acquiring the service chain information related to the error object when the service operation index is abnormal, and selecting the fault node from the plurality of nodes acquired according to the service chain set, a more accurate fault node can be acquired by combining with the service chain information analysis, thereby providing a basis for the subsequent fault root cause analysis.
In some embodiments of the invention, the operation index comprises: service success rate and response time.
In some embodiments of the present invention, obtaining the set of service chains of the service related to the error object includes: acquiring identification information of a service related to the error object; and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
In the above embodiment of the present invention, each service identifier related to an error object is obtained, and then the service chain of the service is found through the same service identifier, and finally the service chain set corresponding to the error object is obtained by summarizing, and all service chain information and node information associated with the error object can be obtained, so that the fault analysis of the service system can obtain an accurate node range based on the associated service chain information, and further obtain a more accurate fault node.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index includes: and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
In some embodiments of the present invention, acquiring a plurality of nodes in the service chain set according to a failure node determination mechanism corresponding to the abnormal operation index includes: and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the present invention, when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index further includes: and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
In some embodiments of the present invention, acquiring the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further includes: and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
In some embodiments of the invention, the faulty node diagnostic system further comprises: and the associated configuration object acquisition module is used for acquiring the multi-dimensional associated configuration object of the fault node.
In some embodiments of the invention, the multi-dimensional associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
In some embodiments of the invention, the faulty node diagnostic system further comprises: and the root cause analysis module is used for positioning the fault cause according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
By acquiring the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node, the embodiment of the invention can collect more comprehensive clues which may cause the fault, and provides a basis for the accuracy of root cause positioning.
In some embodiments of the present invention, locating a fault cause according to feature index data corresponding to a multidimensional associated configuration object of the fault node includes: filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object; acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data; according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data; determining a retrospective relationship between the characteristic index data; and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
In the above embodiment of the present invention, the change information, the alarm information, and the user access information of the multi-dimensional associated configuration object after filtering and summarizing are output, and the corresponding configuration object and the corresponding characteristic index are recommended as the fault cause according to the abnormal degree of the characteristic index data, the weight obtained from the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, and the retrospective relationship between the characteristic index data, so that the cause that may cause the fault can be recommended according to the probability and the association relationship, thereby realizing standardized and intelligent fault cause output, reducing the root cause positioning deviation caused by the difference of human experiences, and realizing standardized and efficient root cause positioning.
According to a third aspect of the present invention, an embodiment of the present invention provides a computer storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause a computer to perform the following operations: the operations include the steps included in the fault analysis method according to any one of the above embodiments.
According to a fourth aspect of the present invention, the present invention provides a computer device including a memory and a processor, the memory being used for storing one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, can implement the fault analysis method according to any one of the above embodiments.
As can be seen from the above description, the method, the system, the storage medium, and the device for analyzing a fault of a service system according to the embodiments of the present invention obtain service chain information related to an error object when a service operation index is abnormal, and select a fault node from a plurality of nodes obtained according to the service chain set, so that a more accurate fault node can be obtained by analyzing the service chain information, and a basis is provided for accuracy of further fault root cause analysis.
Drawings
FIG. 1 is a flow diagram illustrating a method for fault analysis of a business system in accordance with one embodiment of the present invention;
fig. 2 is a flowchart illustrating a fault analysis method when the abnormal operation index is the service success rate in step S12 in fig. 1;
fig. 3 is a flowchart illustrating a fault analysis method when the abnormal operation index is the response time in step S12 of fig. 1;
FIG. 4 is a detailed flowchart of step S17 in FIG. 1;
fig. 5 is an architecture diagram of a fault analysis system of a business system according to an embodiment of the present invention.
Detailed Description
Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known modules, units and their interconnections, links, communications or operations with each other are not shown or described in detail. Furthermore, the described features, architectures, or functions can be combined in any manner in one or more implementations. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. It will also be readily understood that the modules or units or processes of the embodiments described herein and illustrated in the figures can be combined and designed in a wide variety of different configurations.
Fig. 1 is a flowchart illustrating a method for analyzing a fault of a service system according to an embodiment of the present invention.
As shown in fig. 1, in an embodiment of the present invention, the fault analysis method may include: step S11, step S12, step S13, step S14, and step S15, which are specifically described below.
In step S11, an operation index in the service operation process is obtained. In an exemplary embodiment, the operation index may include, but is not limited to, a service success rate, a response time.
In step S12, when the operation index is abnormal, an error object corresponding to the abnormal operation index is acquired.
In step S13, a set of business chains of the business involved in the error object is obtained.
In an optional implementation manner, the service chain set may be obtained by obtaining identification information of a service related to the error object, and further obtaining a service chain of the service related to the error object according to the identification information and aggregating the service chain. By acquiring each service identifier related to the error object, finding the service chain of the service through the same service identifier, finally summarizing to obtain a service chain set corresponding to the error object, all service chain information and node information related to the error object can be obtained, so that fault analysis of a service system can be based on the related service chain information, an accurate node range can be obtained, and a more accurate fault node can be obtained.
In an exemplary embodiment, the service chain is obtained on the premise that each time a service is made, the service initiating end node generates tracking number information including a service name and a unique identifier, and the midway node holds the unique identifier and outputs corresponding service timing sequence information to the final node, so that the tracking number information becomes global uniform identifier information for marking the service chain corresponding to the service.
In step S14, a plurality of nodes are obtained from the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index.
In step S15, a node, the number of occurrences of which exceeds a predetermined threshold, is selected from the plurality of nodes as a failed node.
By adopting the fault analysis method of the embodiment of the invention, the service chain information related to the error object when the service operation index is abnormal is obtained, and the fault node is selected from the plurality of nodes obtained according to the service chain set, so that more accurate fault node can be obtained by combining the service chain information analysis, and a basis is provided for the accuracy of further fault root cause analysis.
In an optional embodiment of the present invention, when the abnormal operation index in step S12 is a service success rate, as shown in fig. 2, the method for analyzing a fault with respect to the abnormal service success rate may specifically include: step S21, step S22, step S23, step S24, and step S25, which are specifically described below.
In step S21, the service success rate is abnormal. In an optional embodiment, the abnormal service success rate may be that the service success rate is not within a predetermined service success rate range, or that a decrease value of the service success rate exceeds a predetermined decrease value.
In step S22, error code objects whose number increases beyond a first threshold are acquired as error objects.
In step S23, a set of business chains of the business involved in the error object is obtained.
In step S24, the last error reporting node on each service chain in the service chain set is obtained to obtain a plurality of service nodes.
In step S25, a node whose present number exceeds a predetermined threshold is selected from the plurality of service nodes as a failed node.
The fault node is judged through the service chain information, so that the omission and excessive acquisition of abnormal nodes can be avoided, the fault abnormal node can be more accurately acquired by combining the service chain information, the dependence on service experts and the workload of manual participation can be reduced, and the standardized, automatic and intelligent fault node judgment is realized.
In an optional embodiment of the present invention, when the abnormal operation indicator in step S12 is a response time, as shown in fig. 3, the method for analyzing a fault of an abnormal response time may specifically include: step S31, step S32, step S33, step S34, and step S35, which are specifically described below.
In step S31, the response time is abnormal. In alternative embodiments, the response time anomaly may be that the response time is not within a predetermined range of response times, or that the increase in response time exceeds a predetermined increase value.
In step S32, a business object whose response time increases beyond the second threshold is acquired as an error object.
In step S33, a set of business chains of the business involved in the error object is obtained.
In step S34, the node with the largest processing time change on each service chain in the service chain set is obtained to obtain a plurality of service nodes.
In step S35, a node whose present number exceeds a predetermined threshold is selected from the plurality of service nodes as a failed node.
Similarly, the fault node is judged through the service chain information, so that not only can abnormal nodes be prevented from being omitted and excessively obtained, but also the fault abnormal nodes can be more accurately obtained by combining the service chain information, the dependence on service experts and the workload of manual participation can be reduced, and the standardized, automatic and intelligent fault node judgment is realized.
Optionally, the method for analyzing a fault of a service system according to an embodiment of the present invention further includes step S16 and step S17 (as shown by a dashed box in fig. 1): step S16, acquiring a multi-dimensional associated configuration object of a fault node; and step S17, positioning the fault reason according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node. In an exemplary embodiment, the multi-dimensional associated configuration object may include, but is not limited to, one or more of the following: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
Fig. 4 is a detailed flowchart of step S17.
As shown in fig. 4, the positioning of the fault cause according to the feature index data corresponding to the multidimensional associated configuration object of the fault node may specifically include: step S71, step S72, step S73, step S74, and step S75, which are specifically described below.
In step S71, change information, alarm information, and user access information of the multi-dimensional associated configuration object are filtered and summarized.
In step S72, the degree and form of abnormality of the feature index data are acquired from the health examination information and the abnormality detection information corresponding to the feature index data.
In step S73, a weight is assigned to the feature index data based on the degree of abnormality of the feature index data and the degree of similarity between the abnormal form of the feature index data and the abnormal form of the operation index.
In step S74, a retrospective relationship between the feature index data is determined. Wherein, the tracing relationship can be: when two abnormal characteristic index data A and two abnormal characteristic index data B exist and one abnormal characteristic index data A causes the other abnormal characteristic index data B to be abnormal, a tracing relation exists between the two abnormal characteristic index data A and the two abnormal characteristic index data B.
In an optional embodiment, the association relationship among the indexes obtained through manual experience is generated into an index relationship tree, so that when the abnormal characteristic index data is obtained, the tracing relationship among the abnormal characteristic index data is determined according to the index relationship tree.
In step S75, change information, alarm information, and user access information of the filtered and summarized multidimensional associated configuration object are output, and a corresponding configuration object and a corresponding characteristic index are recommended as a failure cause according to the weight of the characteristic index data and the retroactive relationship. In an optional implementation manner, a predetermined number of configuration objects and feature indexes are obtained according to the weight ranking for recommendation.
In an optional embodiment, recommending according to the retrospective relationship may include: according to the tracing relation among the abnormal characteristic index data, when determining that one lower layer abnormal characteristic index data causes one upper layer characteristic index data to be abnormal, filtering the upper layer abnormal index, thereby screening out the root cause of the fault.
By the method, the reasons which possibly cause the fault can be recommended according to the probability and the incidence relation, so that standardized and intelligentized (normalized and automated) fault reason output is realized, the root cause positioning deviation caused by personnel experience difference can be reduced, and standardized and efficient root cause positioning is realized.
Fig. 5 is an architecture diagram of a fault analysis system of a business system according to an embodiment of the present invention.
As shown in fig. 5, the monitoring system includes:
an operation index obtaining module 510, configured to obtain an operation index in a service operation process. In an exemplary embodiment, the operation index may include, but is not limited to, a service success rate, a response time.
An error object obtaining module 520, configured to obtain, when the operation index is abnormal, an error object corresponding to the abnormal operation index.
In an optional implementation manner, when the service success rate is abnormal, acquiring error code objects with obviously increased number as error objects; and when the response time is abnormal, acquiring a business object with obviously increased response time as an error object. Illustratively, when the service success rate is abnormal, acquiring error code objects of which the number increases and exceeds a first threshold value as error objects; and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
A service chain set obtaining module 530, configured to obtain a service chain set of a service related to the error object.
In an optional implementation manner, the service chain set may be obtained by obtaining identification information of a service related to the error object, and further obtaining a service chain of the service related to the error object according to the identification information and aggregating the service chain. By acquiring each service identifier related to the error object, finding the service chain of the service through the same service identifier, finally summarizing to obtain a service chain set corresponding to the error object, all service chain information and node information related to the error object can be obtained, so that fault analysis of a service system can be based on the related service chain information, an accurate node range can be obtained, and a more accurate fault node can be obtained.
A node obtaining module 540, configured to obtain multiple nodes in the service chain set according to a failure node judgment mechanism corresponding to the abnormal operation index.
In an optional embodiment, when the service success rate is abnormal, the last error-reporting node on each service chain in the service chain set is used as one of the plurality of nodes; and when the response time is abnormal, taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
A failed node obtaining module 550, configured to select, as a failed node, a node whose occurrence frequency exceeds a predetermined threshold from the multiple nodes.
And an associated configuration object obtaining module 560, configured to obtain a multidimensional associated configuration object of the failed node. In an exemplary embodiment, the multi-dimensional associated configuration object may include, but is not limited to, one or more of the following: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
And the root cause analysis module 570 is configured to locate a fault cause according to the feature index data corresponding to the multidimensional associated configuration object of the fault node.
In an optional embodiment, the locating a fault cause according to feature index data corresponding to the multidimensional associated configuration object of the fault node includes: filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object; acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data; according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data; determining a retrospective relationship between the characteristic index data; and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
By adopting the fault analysis system of the embodiment of the invention, the service chain information related to the error object when the service operation index is abnormal is obtained, and the fault node is selected from the plurality of nodes obtained according to the service chain set, so that more accurate fault node can be obtained by combining the service chain information analysis, and a basis is provided for the accuracy of further fault root cause analysis. And the reason which can cause the fault is recommended according to the probability and the incidence relation, so that the standardized and intelligent (standardized and automatic) fault reason output is realized, the root cause positioning deviation caused by the experience difference of personnel can be reduced, and the standardized and efficient root cause positioning is realized.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.
Correspondingly, the embodiment of the invention also provides a computer readable storage medium, on which computer readable instructions or a program are stored, and when the computer readable instructions or the program are executed by a processor, the computer is enabled to execute the following operations: the operation includes the steps included in the fault analysis method according to any of the above embodiments, and details are not repeated here. Wherein the storage medium may include: such as optical disks, hard disks, floppy disks, flash memory, magnetic tape, etc.
In addition, the embodiment of the present invention further provides a computer device including a memory and a processor, where the memory is used for storing one or more computer instructions or programs, and when the one or more computer instructions or programs are executed by the processor, the fault analysis method according to any one of the above embodiments can be implemented. The computer device may be, for example, a server, a desktop computer, a notebook computer, a tablet computer, or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims (24)

1. A fault analysis method of a business system is characterized by comprising the following steps:
acquiring an operation index in a service operation process;
when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index;
acquiring a service chain set of a service related to the error object;
acquiring a plurality of nodes from the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index;
and selecting the nodes with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as fault nodes.
2. The fault analysis method of claim 1, wherein the operational indicators comprise: service success rate and response time.
3. The method of claim 1, wherein obtaining a set of business chains for the business involved in the erroneous object comprises:
acquiring identification information of a service related to the error object;
and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
4. The fault analysis method according to claim 2, wherein when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index comprises:
and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
5. The method of claim 4, wherein obtaining the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index comprises:
and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
6. The method of analyzing a fault according to claim 4, wherein when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index further comprises:
and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
7. The method of claim 6, wherein obtaining the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further comprises:
and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
8. The fault analysis method according to any one of claims 1 to 7, wherein the faulty node diagnosis method further comprises:
and acquiring a multi-dimensional associated configuration object of the fault node.
9. The fault analysis method of claim 8, wherein the multi-dimensional associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
10. The fault analysis method of claim 9, wherein the faulty node diagnostic method further comprises:
and positioning the fault reason according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
11. The fault analysis method according to claim 10, wherein locating the cause of the fault according to the feature index data corresponding to the multidimensional associated configuration object of the fault node comprises:
filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object;
acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data;
according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data;
determining a retrospective relationship between the characteristic index data;
and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
12. A fault analysis system for a business system, the fault analysis system comprising:
the operation index acquisition module is used for acquiring operation indexes in the service operation process;
the error object acquisition module is used for acquiring an error object corresponding to the abnormal operation index when the operation index is abnormal;
a service chain set obtaining module, configured to obtain a service chain set of a service related to the error object;
the node acquisition module is used for acquiring a plurality of nodes in the service chain set according to a fault node judgment mechanism corresponding to the abnormal operation index;
and the fault node acquisition module is used for selecting the node with the occurrence frequency exceeding a preset threshold value from the plurality of nodes as the fault node.
13. The fault analysis system of claim 12, wherein the operational indicators comprise: service success rate and response time.
14. The fault analysis system of claim 12, wherein obtaining a set of business chains for the business involved in the erroneous object comprises:
acquiring identification information of a service related to the error object;
and acquiring the service chain of the service related to the error object according to the identification information, and summarizing to obtain the service chain set.
15. The fault analysis system according to claim 13, wherein when the operation index is abnormal, acquiring an error object corresponding to the abnormal operation index comprises:
and when the service success rate is abnormal, acquiring error code objects of which the number is increased to exceed a first threshold value as error objects.
16. The fault analysis system according to claim 15, wherein obtaining the plurality of nodes in the service chain set according to the fault node determination mechanism corresponding to the abnormal operation index comprises:
and taking the last error reporting node on each service chain in the service chain set as one of the plurality of nodes.
17. The fault analysis system of claim 15, wherein when the operation index is abnormal, obtaining an error object corresponding to the abnormal operation index further comprises:
and when the response time is abnormal, acquiring the service object with the response time increasing beyond a second threshold value as an error object.
18. The system according to claim 17, wherein obtaining the plurality of nodes in the service chain set according to the failure node determination mechanism corresponding to the abnormal operation index further comprises:
and taking the node with the largest processing time change on each service chain in the service chain set as one of the plurality of nodes.
19. The fault analysis system according to any one of claims 12-18, wherein the faulty node diagnostic system further comprises:
and the associated configuration object acquisition module is used for acquiring the multi-dimensional associated configuration object of the fault node.
20. The fault analysis system of claim 19, wherein the multi-dimensional, associated configuration object comprises at least one of: an association configuration object of an application dimension, an association configuration object of a platform dimension, an association configuration object of a network dimension, an association configuration object of a storage dimension, and an association configuration object of a host system dimension.
21. The fault analysis system of claim 20, wherein the faulty node diagnostic system further comprises:
and the root cause analysis module is used for positioning the fault cause according to the characteristic index data corresponding to the multi-dimensional associated configuration object of the fault node.
22. The fault analysis system of claim 21, wherein locating a fault cause according to the feature metric data corresponding to the multidimensional associated configuration object of the fault node comprises:
filtering and summarizing change information, alarm information and user access information of the multi-dimensional associated configuration object;
acquiring the abnormal degree and abnormal form of the characteristic index data according to the health examination information and the abnormal detection information corresponding to the characteristic index data;
according to the abnormal degree of the characteristic index data and the similarity degree between the abnormal form of the characteristic index data and the abnormal form of the operation index, distributing weight to the characteristic index data;
determining a retrospective relationship between the characteristic index data;
and outputting the change information, the alarm information and the user access information of the filtered and summarized multi-dimensional associated configuration object, and recommending the corresponding configuration object and the corresponding characteristic index as fault reasons according to the weight of the characteristic index data and the tracing relation.
23. A computer storage medium storing computer software instructions for execution by a processor to implement the fault analysis method of any one of claims 1-11.
24. A computer device comprising a memory and a processor;
wherein the memory is to store one or more computer instructions that are executed by the processor to implement the fault analysis method of any of claims 1-11.
CN202010448356.7A 2020-05-25 2020-05-25 Fault analysis method, system, equipment and storage medium of business system Pending CN111722952A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010448356.7A CN111722952A (en) 2020-05-25 2020-05-25 Fault analysis method, system, equipment and storage medium of business system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010448356.7A CN111722952A (en) 2020-05-25 2020-05-25 Fault analysis method, system, equipment and storage medium of business system

Publications (1)

Publication Number Publication Date
CN111722952A true CN111722952A (en) 2020-09-29

Family

ID=72565006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010448356.7A Pending CN111722952A (en) 2020-05-25 2020-05-25 Fault analysis method, system, equipment and storage medium of business system

Country Status (1)

Country Link
CN (1) CN111722952A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052151A (en) * 2020-10-09 2020-12-08 腾讯科技(深圳)有限公司 Fault root cause analysis method, device, equipment and storage medium
CN112308455A (en) * 2020-11-20 2021-02-02 深圳前海微众银行股份有限公司 Root cause positioning method, device, equipment and computer storage medium
CN112579402A (en) * 2020-12-14 2021-03-30 中国建设银行股份有限公司 Method and device for positioning faults of application system
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN112860508A (en) * 2021-01-13 2021-05-28 支付宝(杭州)信息技术有限公司 Abnormal positioning method, device and equipment based on knowledge graph
CN113271224A (en) * 2021-05-17 2021-08-17 中国邮政储蓄银行股份有限公司 Node positioning method and device, storage medium and electronic device
CN113269648A (en) * 2021-06-10 2021-08-17 中国建设银行股份有限公司 Fault node positioning method and device, storage medium and electronic equipment
CN113570084A (en) * 2021-07-29 2021-10-29 重庆允成互联网科技有限公司 Method and system for generating fault analysis report based on equipment maintenance
CN114498587A (en) * 2022-03-25 2022-05-13 中国工商银行股份有限公司 Fault service positioning method, system and device, data processor and related products

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110191630A1 (en) * 2010-01-29 2011-08-04 International Business Machines Corporation Diagnosing a fault incident in a data center
CN106779505A (en) * 2017-02-28 2017-05-31 中国南方电网有限责任公司 A kind of transmission line malfunction method for early warning driven based on big data and system
CN108009040A (en) * 2017-12-12 2018-05-08 杭州时趣信息技术有限公司 A kind of definite failure root because method, system and computer-readable recording medium
CN108833184A (en) * 2018-06-29 2018-11-16 腾讯科技(深圳)有限公司 Service fault localization method, device, computer equipment and storage medium
CN108900353A (en) * 2018-07-18 2018-11-27 平安科技(深圳)有限公司 Fault alarming method and terminal device
CN110166264A (en) * 2018-02-11 2019-08-23 北京三快在线科技有限公司 A kind of Fault Locating Method, device and electronic equipment
CN110493042A (en) * 2019-08-16 2019-11-22 中国联合网络通信集团有限公司 Method for diagnosing faults, device and server
CN110955575A (en) * 2019-11-14 2020-04-03 国网浙江省电力有限公司信息通信分公司 Business system fault positioning method based on correlation analysis model
CN110995468A (en) * 2019-11-13 2020-04-10 上海钧正网络科技有限公司 System fault processing method, device, equipment and storage medium of system to be analyzed
CN111190876A (en) * 2019-12-31 2020-05-22 天津浪淘科技股份有限公司 Log management system and operation method thereof

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110191630A1 (en) * 2010-01-29 2011-08-04 International Business Machines Corporation Diagnosing a fault incident in a data center
CN106779505A (en) * 2017-02-28 2017-05-31 中国南方电网有限责任公司 A kind of transmission line malfunction method for early warning driven based on big data and system
CN108009040A (en) * 2017-12-12 2018-05-08 杭州时趣信息技术有限公司 A kind of definite failure root because method, system and computer-readable recording medium
CN110166264A (en) * 2018-02-11 2019-08-23 北京三快在线科技有限公司 A kind of Fault Locating Method, device and electronic equipment
CN108833184A (en) * 2018-06-29 2018-11-16 腾讯科技(深圳)有限公司 Service fault localization method, device, computer equipment and storage medium
CN108900353A (en) * 2018-07-18 2018-11-27 平安科技(深圳)有限公司 Fault alarming method and terminal device
CN110493042A (en) * 2019-08-16 2019-11-22 中国联合网络通信集团有限公司 Method for diagnosing faults, device and server
CN110995468A (en) * 2019-11-13 2020-04-10 上海钧正网络科技有限公司 System fault processing method, device, equipment and storage medium of system to be analyzed
CN110955575A (en) * 2019-11-14 2020-04-03 国网浙江省电力有限公司信息通信分公司 Business system fault positioning method based on correlation analysis model
CN111190876A (en) * 2019-12-31 2020-05-22 天津浪淘科技股份有限公司 Log management system and operation method thereof

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052151A (en) * 2020-10-09 2020-12-08 腾讯科技(深圳)有限公司 Fault root cause analysis method, device, equipment and storage medium
CN112052151B (en) * 2020-10-09 2022-02-18 腾讯科技(深圳)有限公司 Fault root cause analysis method, device, equipment and storage medium
CN112308455A (en) * 2020-11-20 2021-02-02 深圳前海微众银行股份有限公司 Root cause positioning method, device, equipment and computer storage medium
CN112308455B (en) * 2020-11-20 2024-04-09 深圳前海微众银行股份有限公司 Root cause positioning method, root cause positioning device, root cause positioning equipment and computer storage medium
CN112579402A (en) * 2020-12-14 2021-03-30 中国建设银行股份有限公司 Method and device for positioning faults of application system
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN112579402B (en) * 2020-12-14 2024-08-30 中国建设银行股份有限公司 Method and device for positioning faults of application system
CN112698975B (en) * 2020-12-14 2022-09-27 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN112860508B (en) * 2021-01-13 2023-02-28 支付宝(杭州)信息技术有限公司 Abnormal positioning method, device and equipment based on knowledge graph
CN112860508A (en) * 2021-01-13 2021-05-28 支付宝(杭州)信息技术有限公司 Abnormal positioning method, device and equipment based on knowledge graph
CN113271224A (en) * 2021-05-17 2021-08-17 中国邮政储蓄银行股份有限公司 Node positioning method and device, storage medium and electronic device
CN113269648A (en) * 2021-06-10 2021-08-17 中国建设银行股份有限公司 Fault node positioning method and device, storage medium and electronic equipment
CN113570084B (en) * 2021-07-29 2023-12-29 重庆允丰科技有限公司 Method and system for generating fault analysis report based on equipment maintenance
CN113570084A (en) * 2021-07-29 2021-10-29 重庆允成互联网科技有限公司 Method and system for generating fault analysis report based on equipment maintenance
CN114498587A (en) * 2022-03-25 2022-05-13 中国工商银行股份有限公司 Fault service positioning method, system and device, data processor and related products

Similar Documents

Publication Publication Date Title
CN111722952A (en) Fault analysis method, system, equipment and storage medium of business system
CN111064614B (en) Fault root cause positioning method, device, equipment and storage medium
US9672085B2 (en) Adaptive fault diagnosis
WO2018103453A1 (en) Network detection method and apparatus
WO2021179574A1 (en) Root cause localization method, device, computer apparatus, and storage medium
CN107124289B (en) Weblog time alignment method, device and host
EP3663919B1 (en) System and method of automated fault correction in a network environment
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
US9524223B2 (en) Performance metrics of a computer system
CN110955550A (en) Cloud platform fault positioning method, device, equipment and storage medium
CN113010389A (en) Training method, fault prediction method, related device and equipment
US20240272975A1 (en) Method and system for upgrading cpe firmware
CN107391335B (en) Method and equipment for checking health state of cluster
CN112751711B (en) Alarm information processing method and device, storage medium and electronic equipment
CN111984442A (en) Method and device for detecting abnormality of computer cluster system, and storage medium
CN115514619B (en) Alarm convergence method and system
CN113392893B (en) Method, device, storage medium and computer program product for locating business fault
CN112699007A (en) Method, system, network device and storage medium for monitoring machine performance
WO2019041870A1 (en) Method, device, and storage medium for locating failure cause
CN115766402B (en) Method and device for filtering server fault root cause, storage medium and electronic device
CN111913824B (en) Method for determining data link fault cause and related equipment
CN115576738A (en) Method and system for realizing equipment fault determination based on chip analysis
CN117170915A (en) Data center equipment fault prediction method and device and computer equipment
CN110489260B (en) Fault identification method and device and BMC
CN112416896A (en) Data abnormity warning method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination