CN115409405A - Fault root cause positioning method and device, electronic equipment and storage medium - Google Patents

Fault root cause positioning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115409405A
CN115409405A CN202211121300.6A CN202211121300A CN115409405A CN 115409405 A CN115409405 A CN 115409405A CN 202211121300 A CN202211121300 A CN 202211121300A CN 115409405 A CN115409405 A CN 115409405A
Authority
CN
China
Prior art keywords
application system
health evaluation
cascade
health
evaluation value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211121300.6A
Other languages
Chinese (zh)
Inventor
程捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bo Hongyuan Data Polytron Technologies Inc
Original Assignee
Beijing Bo Hongyuan Data Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bo Hongyuan Data Polytron Technologies Inc filed Critical Beijing Bo Hongyuan Data Polytron Technologies Inc
Priority to CN202211121300.6A priority Critical patent/CN115409405A/en
Publication of CN115409405A publication Critical patent/CN115409405A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Computer Hardware Design (AREA)
  • Game Theory and Decision Science (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The application discloses a fault root cause positioning method and device, electronic equipment and a storage medium. Acquiring the cascade relation of each application system in the cascade application system of the current application system, and constructing a cascade relation graph; determining the health evaluation value of the current application system according to the evaluation index data of the current application system; according to the health evaluation value and the cascade relation graph of the current application system, performing health evaluation on a parent application system and a child application system of the current application system; and positioning the fault of the cascade application system based on the health evaluation result and the health evaluation value of the current application system. The health evaluation is carried out by different evaluation indexes, the comprehensiveness and the accuracy of the health evaluation are improved, and the defects that no alarm data is generated when a system fails and the fault root cannot be positioned in the prior art are overcome. The health evaluation of the parent application system and the child application system comprehensively locates the fault root, and the accuracy and the efficiency of fault root cause location are improved.

Description

Fault root cause positioning method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of fault root cause location technologies, and in particular, to a fault root cause location method and apparatus, an electronic device, and a storage medium.
Background
With the advent of the cloud computing era, more and more occasions need to construct data centers, the number of application systems based on the data centers is rapidly increased, and various application systems have complex calling and accessing relationships, so that the running stability and reliability of the systems are problematic. Therefore, identification and root cause localization of failures in cascading applications have become one direction of social research.
Currently, in order to ensure the reliability of a cascade system, the prior art needs to rely on the professional knowledge of technical personnel to check different application systems in the cascade system one by one, and fault root cause analysis is performed through information gathering so as to determine the fault root cause. However, this method consumes a lot of manpower and time, and the efficiency of fault root location is low and the accuracy is not guaranteed.
Disclosure of Invention
The application provides a fault root cause positioning method and device, electronic equipment and a storage medium, so that the accuracy of fault identification and the efficiency of fault root cause positioning are improved.
According to an aspect of the present application, there is provided a fault root cause locating method, the method including:
acquiring the cascade relation of each application system in the cascade application system of the current application system, and constructing a cascade relation graph;
determining the health evaluation value of the current application system according to the evaluation index data of the current application system;
according to the health evaluation value and the cascade relation graph of the current application system, performing health evaluation on a parent application system and a child application system of the current application system;
and positioning the fault root of the cascade application system based on the health evaluation result and the health evaluation value of the current application system.
According to another aspect of the present application, there is provided a fault root cause locating device, including:
the cascade connection diagram building module is used for obtaining the cascade connection relationship of each application system in the cascade connection application system of the current application system and building a cascade connection relationship diagram;
the evaluation value determining module is used for determining the health evaluation value of the current application system according to the evaluation index data of the current application system;
the cascade evaluation module is used for carrying out health evaluation on a parent application system and a child application system of the current application system according to the health evaluation value of the current application system and the cascade relation graph;
and the fault root cause positioning module is used for positioning the fault root cause of the cascade application system based on the health evaluation result and the health evaluation value of the current application system.
According to another aspect of the present application, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method for fault root cause localization according to any of the embodiments of the present application.
According to another aspect of the present application, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the fault root cause locating method according to any one of the embodiments of the present application when the computer instructions are executed.
According to the technical scheme of the embodiment of the application, the health evaluation is carried out on the current application system, the parent application system and the child application system of the current application system, and the diagnosis and root cause positioning are carried out on the faults existing in the cascade application system according to the health evaluation values of all the application systems. The health evaluation of the application system by different evaluation indexes can improve the comprehensiveness and accuracy of the health evaluation, and the defect that fault root cause positioning cannot be performed because no alarm data is generated when a fault occurs in the prior art is overcome. Meanwhile, the fault root is comprehensively positioned through the health evaluation of the parent application system and the child application system, and the accuracy and the efficiency of the fault root positioning are improved.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present application, nor are they intended to limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a fault root cause locating method according to an embodiment of the present application;
fig. 2A is a flowchart of a fault root cause locating method according to a second embodiment of the present application;
fig. 2B is a schematic diagram of a cascading application system provided in accordance with an embodiment ii of the present application;
fig. 3 is a schematic structural diagram of a fault root cause locating device according to a third embodiment of the present application;
FIG. 4 is a schematic diagram of a root cause location diagnosis system according to a fourth embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device implementing the fault root cause locating method according to the embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1 is a flowchart of a fault root cause positioning method provided in an embodiment of the present application, where the embodiment is applicable to a case of performing root cause positioning on a fault of a cascade system, and the method may be executed by a fault root cause positioning apparatus, where the fault root cause positioning apparatus may be implemented in a hardware and/or software manner, and the fault root cause positioning apparatus may be configured in an electronic device. As shown in fig. 1, the method includes:
s110, acquiring the cascade relation of each application system in the cascade application system where the current application system is located, and constructing a cascade relation graph.
The cascade application system may be formed by a plurality of application systems having a cascade relationship, and the cascade relationship may include, but is not limited to, a top-bottom calling relationship (that is, the application system may have a parent application system and/or a child application system), a level access relationship, a cross-layer calling relationship, and the like. And constructing a cascade relation graph which embodies the mutual calling and/or access relation of the applications according to the actual cascade relation of the application systems. The source of the cascade relationship may be according to various different data sources, for example, the source may include, but is not limited to, a call relationship graph between applications, a call network topology graph, a CMDB (Configuration Management Database), or a cascade relationship graph configured manually by a technician, and the like. It should be noted that, a specific construction method may adopt any cascade relation graph construction algorithm in the prior art, which is not limited in the embodiment of the present application.
And S120, determining the health evaluation value of the current application system according to the evaluation index data of the current application system.
The evaluation index data may be index data for evaluating a health state of each application system in the cascade application system. Each application system has its own corresponding evaluation index data. The evaluation index data may include, but is not limited to, service index data in a software layer, performance index data in a hardware layer, alarm data corresponding to whether an operation state has a problem, and the like. And evaluating the health state of the current application system according to the multi-dimensional and multi-directional different evaluation index data. The health state may refer to a probability that the current application system may fail, and this probability is represented by a health evaluation value. The health evaluation values can be scored or graded according to the condition that each evaluation index data reflects the operation condition of the current application system.
In an optional embodiment, the evaluation index data includes service index data, performance index data, and alarm data; correspondingly, the determining the health evaluation value of the current application system according to the evaluation index data of the current application system may include: and determining the health evaluation value of the current application system according to at least one of the service index data, the performance index data and the alarm data of the current application system.
The service index data may include, but is not limited to, transaction amount, response time, and the like of the current application system; the performance index data may include, but is not limited to, a CPU (Central Processing Unit) usage rate and a memory usage rate of a current application system; the alarm data can be data information of warning or error prompt which is sent when the current application system has operation problems.
Specifically, the scoring mode of the health evaluation value may be directly assigned according to the degree of difference between the evaluation index data of the current application system and the index in normal operation. Taking the service index data as an example, comparing each service index data with the index under the actual normal operation condition, and grading or grading the current application system from the perspective of the service index data. For example, if the response time of the current application system to a certain service is within 100ms and meets the normal operation index, the current application system may be scored as 100 points from the index of the response time. If the corresponding time of the current application system to a certain service exceeds 100ms, the index can be scored according to the degree of exceeding 100ms, for example, the score can be between 0 and 100. The scores of all indexes belonging to the same service index data can be used for carrying out health evaluation on the current application system from the perspective of the service index data in a weighted sum mode. Similarly, the performance index data and the alarm data can also be used for health evaluation of the current application system in the above manner. Therefore, scores corresponding to different evaluation index data can be obtained, and the final health evaluation value is calculated in a weighted sum mode. Of course, the weights used in the weighting and calculation may be obtained by a relevant technician according to a large number of experiments or manual experiences, and different evaluation index data may be analyzed according to a machine learning model or a big data model trained in advance, so as to determine different weights according to influences of the different evaluation index data on the application system, which is not limited in the embodiment of the present application.
Of course, the points may be deducted from the full point (for example, 100 points) downward in accordance with the degree of difference between each evaluation index data and the index of the normal operation state. For example, since the service index data includes multiple indexes, the indexes in the normal operation state are compared according to different indexes, if the normal operation index cannot be reached, the scores are deducted from full scores, the multiple indexes are accumulated, and the score corresponding to the service index data can be obtained after deducting all the items. For example, if the transaction amount per minute of the current application system is 30% less than that in the normal operation state, 10 points are deducted, and meanwhile, if the response time exceeds 30ms compared with the normal 100ms, 10 points are deducted, at this time, the score corresponding to the service index data is only 80 points, and if other indexes exist, the deduction is continued (until all indexes participate in the score calculation or the score is deducted to 0 point). Similarly, the performance index data and the alarm data can also be used for health evaluation of the current application system in the above manner. Therefore, scores corresponding to different evaluation index data can be obtained, and the final health evaluation value is calculated in a weighted sum mode. Of course, the weights used in the weighting and calculation may be obtained by a relevant technician according to a large number of experiments or manual experiences, and different evaluation index data may be analyzed according to a machine learning model or a big data model trained in advance, so as to determine different weights according to influences of the different evaluation index data on the application system, which is not limited in the embodiment of the present application.
And S130, performing health evaluation on the parent application system and the child application system of the current application system according to the health evaluation value of the current application system and the cascade relation diagram.
The parent application system can be an upstream system of the current application system, namely the parent application system can call the current application system; similarly, the sub-application system may be a system downstream of the current application system, i.e., the current application system may invoke the sub-application system. The current application system may include a plurality of parent application systems, or may include a plurality of child application systems at the same time.
After the health evaluation value of the current application system is determined in the foregoing steps, the current application system may be determined, and whether further checking of the parent application system and the child application system of the current application system is required or not may be performed. For example, the health evaluation value of the current application system reflects that the current application system has some faults but may not be a fault source, so that it is necessary to probe a parent application system and a child application system of the current application system, and perform health evaluation on each parent application system and each child application system to obtain the health evaluation value. It will be appreciated that the method of health assessment for the parent and child application systems may be the same as the method of computing the health assessment value for the current application system.
And S140, positioning the fault root of the cascade application system based on the health evaluation result and the health evaluation value of the current application system.
The health evaluation result is the result of the health evaluation of each parent application system and each child application system, and the fault root cause positioning method may use any fault root cause positioning algorithm in the prior art, which is not limited in the embodiment of the present application.
In an optional implementation manner, the locating the fault root of the cascading application system based on the health evaluation result and the health evaluation value of the current application system may include: and determining the fault application system of the cascade application system according to the health evaluation value of the current application system, the health evaluation value of the parent application system and the health evaluation value of the child application system.
It will be appreciated that the health assessment results may include the health assessment values of each parent application system and each child application system. Therefore, the health evaluation values of the previous application systems, the health evaluation values of the parent application systems and the health evaluation values of the child application systems are statistically compared, and N application systems with the lowest health evaluation values (i.e., the highest failure probability) are selected as the location of the failure root cause, where N may be preset by a related technician.
According to the technical scheme of the embodiment of the application, the health evaluation is carried out on the current application system, the parent application system and the child application system of the current application system, and the diagnosis and root cause positioning are carried out on the faults existing in the cascade application system according to the health evaluation values of the application systems. The health evaluation of the application system by different evaluation indexes can improve the comprehensiveness and accuracy of the health evaluation, and the defect that fault root cause positioning cannot be carried out due to the fact that no alarm data is generated when a fault occurs in the prior art is overcome. Meanwhile, the fault source is comprehensively positioned through the health evaluation of the parent application system and the child application system, and the accuracy of fault root cause positioning and the efficiency of fault identification are improved.
Example two
Fig. 2A is a flowchart of a fault root cause locating method according to a second embodiment of the present application, where the present embodiment is further detailed on the basis of the foregoing embodiments of the present application for health evaluation operations of a parent application system and a child application system. As shown in fig. 2A, the method includes:
s210, obtaining the cascade relation of each application system in the cascade application system where the current application system is located, and constructing a cascade relation graph.
And S220, determining the health evaluation value of the current application system according to the evaluation index data of the current application system.
And S230, if the health evaluation value of the current application system meets a preset health threshold value, performing health evaluation on the parent application system and the child application system according to the cascade relation diagram.
If the health evaluation value of the current application system meets the preset health threshold, it is indicated that the probability that the current application system is a failure source is low, and further probing needs to be performed on the upstream and downstream application systems. Therefore, the health evaluation is carried out on the parent application system and the child application system of the current application system according to the upstream and downstream relations in the cascade relation graph.
In an alternative embodiment, the sub-application systems include a shallow sub-application system and a deep sub-application system; correspondingly, the performing health evaluation on the sub-application system according to the cascade relationship diagram may include: determining the health evaluation value of the shallow sub-application system according to the evaluation index data of the shallow sub-application system; and if the health evaluation value of the shallow sub-application system accords with a preset health threshold value, performing health evaluation on a deep sub-application system of the shallow sub-application system.
The shallow sub application system and the deep sub application system are relative concepts, and because each application system in the cascade relation graph has calling relations between the upper and the lower parts, the current application system may have multi-level sub application systems. For example, the primary sub-application system is at least one application system directly called by the current application system, and the secondary sub-application system is at least one application system called by the primary sub-application system. Then, the shallow sub-application may be a first-level sub-application, and correspondingly, the deep sub-application may be a second-level sub-application. When the shallow sub-application system is a second-level sub-application system, the deep sub-application system is a third-level sub-application system, and so on.
If the health evaluation value of the shallow sub-application system meets a preset health threshold value after the health evaluation is performed on the shallow sub-application system (namely the shallow sub-application system has low probability of being a fault source and needs to be further explored to a downstream system), the health evaluation is performed on each deep sub-application system of the shallow sub-application system.
Further, if the health evaluation value of the shallow sub-application system is greater than the preset health threshold value, the health evaluation of the deep sub-application system of the shallow sub-application system is stopped.
It will be appreciated that if the health score of a shallow sub-application is very high (i.e., indicating that the shallow sub-application is the root cause of the fault is very unlikely), then there is no need to continue probing downstream systems of the shallow sub-application. By the method, exploration and calculation resources can be saved, fault root cause positioning and root cause positioning time can be saved, exploration range of the sub-application system is reduced, and fault root cause positioning and positioning efficiency is improved.
In another alternative embodiment, the parent application system includes at least one hierarchy; correspondingly, the performing health evaluation on the parent application system may further include: and determining the health evaluation value of the parent application system of at least one hierarchy according to the evaluation index data of the parent application system.
It can be understood that, similar to a child application system of a current application system having multiple hierarchies, the current application system may have multiple hierarchies of a parent application system, each hierarchy includes at least one application system, and since an upstream system calls a downstream system, the upstream system has less influence on the failure of the downstream system, and thus, when the parent application systems of multiple hierarchies are explored, a comprehensive health evaluation needs to be performed.
In an alternative embodiment, the method may further comprise: if the system evaluation quantity of the parent application system or the child application system for health evaluation accords with the preset evaluation quantity, controlling to stop the health evaluation of the parent application system; wherein, the evaluation quantity comprises the number and/or the hierarchy of the application systems.
The evaluation amount may be the number of systems or the number of levels for evaluating the health of the application system. The preset evaluation amount may be a preset limit of the number of systems and/or the number of layers, and may be understood as a condition for stopping probing. Taking a parent application system as an example (the same applies to the child application systems), setting the preset evaluation value as 30 systems, probing the parent application system of the current application system (determining the health evaluation value of each parent application system), and stopping probing when the probing reaches 30 application systems; the preset evaluation amount can also be set to be 4 levels, and the exploration is stopped when the upstream four-layer parent application system of the current application system is explored. Of course, the number of systems and the number of hierarchies may be used in a fusion manner, which is not limited in this embodiment of the present application.
And S240, positioning the fault root of the cascade application system based on the health evaluation result and the health evaluation value of the current application system.
In one specific embodiment, as shown in fig. 2B, an application B0 has primary parent applications B1, B2, and B3, and secondary parent applications B11, B12, B21, B22, B31, and B32; meanwhile, the system is provided with primary sub application systems B4, B5 and B6, and primary and secondary sub application systems B41, B42, B51, B52, B61 and B62. These application systems constitute a cascade relationship diagram as shown in fig. 2B. It should be noted that fig. 2B only illustrates two-level parent application systems and two-level child application systems, and is not to be construed as limiting the number of hierarchies.
When performing fault root cause analysis on a certain application system B0 (equivalent to the current application system) in fig. 2B, first, a weighted sum is performed according to multiple evaluation index data to calculate a probability of the application system B0 failing (equivalent to the aforementioned health evaluation score), then, all first-level parent application systems (B1, B2, and B3 shown in the figure) of the application system B0 are searched for an upstream system according to a cascade relation diagram, and the same multiple evaluation index data weighted sum method is applied to each first-level parent application system to calculate a failure probability of each parent application system. When the mining termination condition is not triggered, each parent application system of the primary parent application system, namely the second-level parent application system (such as B11, B12, B21, B22, B31, and B32 shown in the figure) can be continuously explored upstream according to the cascade relation diagram, and the fault probability is calculated by adopting the same multiple evaluation index data weighting and method for each second-level parent application system. Similarly, when the termination condition is not triggered (that is, the preset evaluation value is reached), the third level and the fourth level of the fault probability of the parent application system can be continuously searched upwards for (8230); the fault probability of the parent application system can be continuously searched out until the termination condition is reached. Similarly, the failure probability of each sub-application system is calculated by adopting a similar multiple evaluation index data weighting and method for the first-level sub-application systems (such as systems B4, B5 and B6 shown in the figure), and when the termination condition is not triggered, the failure probability calculation is continuously carried out on the second-level, third-level and fourth-level \8230;, sub-application systems until the termination condition is reached.
It should be added that the cascading application system is not limited to that shown in fig. 2B, but may also be other cascading relationship diagrams in different forms, for example, a mesh diagram reflecting a multi-application relationship between a service provider and a service consumer, or an application relationship diagram of direct cascading and cross-layer cascading, or an application relationship diagram of cluster deployment where multiple applications in an upper stage and a lower stage are not distinguished from each other, or a mesh diagram of cluster deployment where multiple applications in a lower stage are collocated, or a mesh diagram of cluster deployment and a mesh diagram of cascade connection to another cluster deployment, which is not limited in this embodiment of the present application.
According to the technical scheme of the embodiment of the application, through the hierarchical evaluation of the parent application system and the child application system, the child application system is subjected to targeted exploration and the parent application system is subjected to comprehensive exploration, so that the fault root cause positioning and the root cause identification are more efficient and accurate.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a fault root cause positioning device according to a third embodiment of the present application.
As shown in fig. 3, the apparatus 300 includes:
a cascade graph constructing module 310, configured to obtain a cascade relationship of each application system in a cascade application system in which a current application system is located, and construct a cascade relationship graph;
an evaluation value determining module 320, configured to determine a health evaluation value of the current application system according to evaluation index data of the current application system;
the cascade evaluation module 330 is configured to perform health evaluation on a parent application system and a child application system of the current application system according to the health evaluation value of the current application system and the cascade relationship diagram;
and the fault root cause positioning module 340 is configured to position a fault root cause of the cascading application system based on the health evaluation result and the health evaluation value of the current application system.
According to the technical scheme of the embodiment of the application, the health evaluation is carried out on the current application system, the parent application system and the child application system of the current application system, and the diagnosis and root cause positioning are carried out on the faults existing in the cascade application system according to the health evaluation values of the application systems. The health evaluation of the application system by different evaluation indexes can improve the comprehensiveness and accuracy of the health evaluation, and the defect that fault root cause positioning cannot be carried out due to the fact that no alarm data is generated when a fault occurs in the prior art is overcome. Meanwhile, the fault root is comprehensively positioned through the health evaluation of the parent application system and the child application system, and the accuracy and the efficiency of the fault root positioning are improved.
In an alternative embodiment, the cascade evaluation module 330 may be specifically configured to:
and if the health evaluation value of the current application system accords with the preset health threshold value, performing health evaluation on the parent application system and the child application system according to the cascade relation graph.
In an alternative embodiment, the sub-application systems include a shallow sub-application system and a deep sub-application system; accordingly, the cascade evaluation module 330 may include:
the shallow evaluation unit is used for determining the health evaluation value of the shallow sub-application system according to the evaluation index data of the shallow sub-application system;
and the deep evaluation unit is used for carrying out health evaluation on the deep sub-application system of the shallow sub-application system if the health evaluation value of the shallow sub-application system meets a preset health threshold value.
In an alternative embodiment, the parent application system includes at least one hierarchy; accordingly, the cascade evaluation module 330 may include:
and the parent system evaluation unit is used for determining the health evaluation value of the parent application system of at least one hierarchy according to the evaluation index data of the parent application system.
In an alternative embodiment, the apparatus 300 may comprise:
if the system evaluation quantity of the parent application system or the child application system for health evaluation accords with the preset evaluation quantity, controlling to stop performing health evaluation on the cascade application system; wherein, the evaluation quantity comprises the number and/or the hierarchy of the application systems.
In an optional implementation manner, the fault root cause locating module 340 may specifically be configured to:
and determining the fault application system of the cascade application system according to the health evaluation value of the current application system, the health evaluation value of the parent application system and the health evaluation value of the child application system.
In an optional embodiment, the evaluation index data includes service index data, performance index data, and alarm data; accordingly, the evaluation value determining module 320 may specifically be configured to:
and determining the health evaluation value of the current application system according to at least one of the service index data, the performance index data and the alarm data of the current application system.
The fault root cause positioning device provided by the embodiment of the application can execute the fault root cause positioning method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of executing each fault root cause positioning method.
Example four
The fourth embodiment of the present application is a preferred embodiment of a fault root cause positioning system provided on the basis of the foregoing embodiments, and the fault root cause positioning system described in the embodiments of the present application can be applied to a scenario of diagnosing and positioning a fault root cause of a cascade application system, and has the functions and beneficial effects of the methods and apparatuses in the foregoing embodiments. The fault root cause positioning system is shown in fig. 4, and specifically includes the following steps:
the fault root cause positioning system includes a plurality of different functional modules (the functional modules described herein are different from the functional modules in the third embodiment and are only used for illustration in this embodiment), such as a data acquisition module, a data storage module, a data processing module, a fault positioning module, a result storage module, and an interface display module. The functions and the number of the above modules are not limiting factors of the present application, and it should be understood by those skilled in the art that integrating, deleting, modifying or replacing some modules of the positioning system with other modules does not affect the overall function of the system, and thus the present application also falls within the protection scope.
The types of data collected in this application include, but are not limited to: service gold index data of a certain application system, CMDB service architecture data supporting the service system, various calling relation data depended by different nodes in the application system, performance index data reflecting hardware or software states of the nodes in the service system during operation, alarm data generated when the performance index or the service gold index is abnormal and the like.
The data acquisition module that includes among the disclosed trouble root cause positioning system of this application, its main function is to carry out real-time collection with the data of multiple different dimensions, ensures that data all attributes to unified data storage module from different sources, and its collection method includes but not limited to: accessing real-time data from a Kafka data source, acquiring the real-time data in a Restful API mode, polling the data source to obtain various real-time data and the like.
The data storage module included in the fault root cause positioning system has the main function of persistently storing data with various dimensions collected by the data acquisition module, and the main storage carrier of the data storage module is provided with various different relational databases, such as MySql, oracle and the like, and also can be various different non-relational databases, such as MongoDB, elastic search databases, HDFS databases, HBase, GBase and the like. The data storage module mainly considers the capacity of a hard disk in a storage system. For example, the data storage is not limited to a certain database, and may be a combination of multiple different databases, for example, a storage method that jointly stores multiple different databases and then establishes an index mapping relationship node is used, for example, for performance index data, the data size is usually very large, the data size may be placed in ES (elastic search, distributed search and analysis engine) to separately establish an index for storage, the alarm data may be stored in multiple relational databases, such as databases of MySql and Oracle, or may be stored in multiple non-relational databases, such as MongoDB and NoSQL, and then the data mapping relationship of each node is established according to the stored database type, database address, port number, and the like, so as to construct the index mapping relationship node.
The data processing module included in the fault root cause positioning system has a main function of performing fast Streaming processing on data acquired in real time, and also performing historical batch processing on stored data, the technical framework adopted in the data processing module includes, but is not limited to, data processing using Spark Streaming, spark SQL, spark MLlib, and the like, and may also adopt data processing modules written in computer languages such as Python, java, matLab, C + + and C #, and the content of data processing includes, but is not limited to: the data is normalized, and the data is subjected to field alignment, field format standardization, data missing value filling, data distribution type conversion, category data numeralization conversion and the like. The data processing module is equivalent to a buffer layer, and the purpose of the buffer layer is to ensure that collected data can be better conveyed to the fault positioning module after being subjected to unified and standardized processing so as to reduce the calculation pressure of the fault positioning module and accelerate the time for positioning the fault root cause.
The fault root cause positioning system comprises a fault positioning module which has the main functions of comprehensively analyzing the multidimensional data obtained by the data processing module, calculating the fault probability (equivalent to the health evaluation value) of each level of application system in the cascade application system, finally positioning the application system with the fault and determining the application system of the fault root cause. The root cause result obtained by the fault positioning module can be further stored in the result storage module.
The result storage module included in the fault root cause positioning system has a main function of storing data such as intermediate calculation results of the fault positioning module and fault root cause results obtained by positioning, and some intermediate results are temporarily stored, so that the intermediate results can be stored in message queues such as Kafka and RabbitMQ, can be stored in various databases such as Redis, mySQL and MongoDB, and can be stored in databases such as ElasticSearch, oracle, and the like. The result storage module is provided with a timing cleaning mechanism because part of data is temporarily stored, and under certain preset triggering conditions, the module can automatically clean the intermediate result in the module to ensure that the temporary result is kept up to date in timeliness. The predetermined trigger conditions may include, but are not limited to: the cleaning is started when the data storage time exceeds a certain preset value (such as 1 month, 10 days, 24 hours and the like), the cleaning is started when the disk space occupied by the data exceeds a certain preset value (such as 1G,10G, 100M and the like), a cleaning buffer button is arranged on an interface, and the cleaning is manually triggered when a user clicks the button.
The interface display module in the fault root cause positioning system has the main functions of displaying various data in an interface mode and providing an interactive interface for man-machine communication with a user, and more particularly, the interface display module can display original multi-dimensional data (different evaluation index data and the like) which are not processed at all, so that technicians can conveniently check the change trend of the original data, and when the accuracy of a root cause positioning algorithm is difficult to meet the requirement, the technicians can conveniently diagnose and investigate the fault root cause positioning algorithm according to manual experience. Meanwhile, the final result obtained by the fault positioning module in the embodiment can be displayed on an interface, so that technical personnel in the aspect can check the positioning result and perform operations such as fault troubleshooting and repairing according to the result. On the other hand, in order to facilitate the technical staff to debug the built-in algorithm, the interface display module can also provide an interactive interface, so that the technical staff can configure different parameters for the algorithm module for fault root cause positioning, and store the parameter configuration into a back-end configuration file for parameter selection of next promotion of the algorithm.
EXAMPLE five
FIG. 5 illustrates a schematic structural diagram of an electronic device 10 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. Processor 11 performs the various methods and processes described above, such as the fault root cause location method.
In some embodiments, the fault root cause location method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the fault root cause localization method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the fault root cause location method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of this application, a computer readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solution of the present application can be achieved, and the present invention is not limited thereto.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for locating a fault root cause, the method comprising:
acquiring the cascade relation of each application system in the cascade application system of the current application system, and constructing a cascade relation graph;
determining the health evaluation value of the current application system according to the evaluation index data of the current application system;
according to the health evaluation value of the current application system and the cascade relation graph, performing health evaluation on a parent application system and a child application system of the current application system;
and positioning the fault root of the cascade application system based on the health evaluation result and the health evaluation value of the current application system.
2. The method of claim 1, wherein the performing the health evaluation on the parent application system and the child application system of the current application system according to the health evaluation value and the cascade relationship graph comprises:
and if the health evaluation value of the current application system accords with a preset health threshold value, performing health evaluation on the parent application system and the child application system according to the cascade relation graph.
3. The method of claim 2, wherein the sub-application systems comprise a shallow sub-application system and a deep sub-application system; correspondingly, the performing health evaluation on the sub-application system according to the cascade relationship diagram includes:
determining the health evaluation value of the shallow sub-application system according to the evaluation index data of the shallow sub-application system;
and if the health evaluation value of the shallow sub-application system is greater than the preset health threshold value, performing health evaluation on a deep sub-application system of the shallow sub-application system.
4. The method of claim 2, wherein the parent application system comprises at least one tier; correspondingly, the performing health evaluation on the parent application system further includes:
and determining the health evaluation value of the parent application system of at least one hierarchy according to the evaluation index data of the parent application system.
5. The method of claim 4, further comprising:
if the system evaluation amount of the parent application system or the child application system for health evaluation meets a preset evaluation amount, controlling to stop health evaluation on the cascade application system; wherein the evaluation quantity comprises the number and/or the hierarchy of the application systems.
6. The method of claim 5, wherein locating the fault root cause of the cascading application system based on the health evaluation result and the health evaluation value of the current application system comprises:
and determining a fault application system of the cascade application system according to the health evaluation value of the current application system, the health evaluation value of the parent application system and the health evaluation value of the child application system.
7. The method according to any of claims 1-6, wherein the evaluation index data comprises business index data, performance index data, and alarm data; correspondingly, the determining the health evaluation value of the current application system according to the evaluation index data of the current application system includes:
and determining the health evaluation value of the current application system according to at least one of the service index data, the performance index data and the alarm data of the current application system.
8. A fault root cause locating device, comprising:
the cascade connection diagram building module is used for obtaining the cascade connection relationship of each application system in the cascade connection application system of the current application system and building a cascade connection relationship diagram;
an evaluation value determining module, configured to determine a health evaluation value of the current application system according to the evaluation index data of the current application system;
the cascade evaluation module is used for carrying out health evaluation on a parent application system and a child application system of the current application system according to the health evaluation value of the current application system and the cascade relation graph;
and the fault root cause positioning module is used for positioning the fault root cause of the cascade application system based on the health evaluation result and the health evaluation value of the current application system.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the fault root location method of any one of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a processor to implement the method of fault root location according to any one of claims 1-7 when executed.
CN202211121300.6A 2022-09-15 2022-09-15 Fault root cause positioning method and device, electronic equipment and storage medium Pending CN115409405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211121300.6A CN115409405A (en) 2022-09-15 2022-09-15 Fault root cause positioning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211121300.6A CN115409405A (en) 2022-09-15 2022-09-15 Fault root cause positioning method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115409405A true CN115409405A (en) 2022-11-29

Family

ID=84165993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211121300.6A Pending CN115409405A (en) 2022-09-15 2022-09-15 Fault root cause positioning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115409405A (en)

Similar Documents

Publication Publication Date Title
CN115033463B (en) System exception type determining method, device, equipment and storage medium
CN113360722B (en) Fault root cause positioning method and system based on multidimensional data map
CN116049146B (en) Database fault processing method, device, equipment and storage medium
CN115396289B (en) Fault alarm determining method and device, electronic equipment and storage medium
CN112631887A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and computer-readable storage medium
CN115509797A (en) Method, device, equipment and medium for determining fault category
CN116471174A (en) Log data monitoring system, method, device and storage medium
CN115409405A (en) Fault root cause positioning method and device, electronic equipment and storage medium
CN115794473A (en) Root cause alarm positioning method, device, equipment and medium
CN115437961A (en) Data processing method and device, electronic equipment and storage medium
CN114881112A (en) System anomaly detection method, device, equipment and medium
CN114896418A (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN114885014A (en) Method, device, equipment and medium for monitoring external field equipment state
CN113722141A (en) Method and device for determining delay reason of data task, electronic equipment and medium
CN113656207B (en) Fault processing method, device, electronic equipment and medium
CN113553256B (en) AB test method and device and electronic equipment
CN115077906A (en) Engine high-occurrence fault cause determination method, engine high-occurrence fault cause determination device, electronic equipment and medium
CN117608896A (en) Transaction data processing method and device, electronic equipment and storage medium
CN117670128A (en) Data processing method and device
CN117573412A (en) System fault early warning method and device, electronic equipment and storage medium
CN115146986A (en) Data center equipment maintenance method, device, equipment and storage medium
CN114971695A (en) Industry trend prediction method, apparatus, device, medium, and program product
CN117608904A (en) Fault positioning method and device, electronic equipment and storage medium
CN116502841A (en) Event processing method and device, electronic equipment and medium
CN116150024A (en) Method and device for evaluating program to be tested and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination