CN113326161B - Root cause analysis method - Google Patents

Root cause analysis method Download PDF

Info

Publication number
CN113326161B
CN113326161B CN202110610565.1A CN202110610565A CN113326161B CN 113326161 B CN113326161 B CN 113326161B CN 202110610565 A CN202110610565 A CN 202110610565A CN 113326161 B CN113326161 B CN 113326161B
Authority
CN
China
Prior art keywords
node
event
data
data node
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110610565.1A
Other languages
Chinese (zh)
Other versions
CN113326161A (en
Inventor
张广意
刘超
冯经宇
李华桂
伍健君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110610565.1A priority Critical patent/CN113326161B/en
Publication of CN113326161A publication Critical patent/CN113326161A/en
Priority to PCT/CN2021/132783 priority patent/WO2022252512A1/en
Application granted granted Critical
Publication of CN113326161B publication Critical patent/CN113326161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Abstract

The embodiment of the application provides a root cause analysis method, a root cause analysis device, electronic equipment and a computer storage medium; the method comprises the following steps: acquiring a dependency relationship among the data nodes and events on the data nodes; determining the weight of the event on the first data node according to the dependency relationship among the data nodes and/or the precedence relationship of the occurrence time of the event on each data node; determining root cause relations among the events on each data node according to the dependency relations among each data node, the events on each data node and the weights of the events on each data node; in the case where the event on each data node includes a target event, a root cause of the target event is determined according to a root cause relationship between the events on each data node.

Description

Root cause analysis method
Technical Field
The present application relates to information technology of financial technology (Fintech), and relates to, but is not limited to, a root cause analysis method.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changed into financial technology, but due to the requirements of safety and instantaneity of the financial industry, higher requirements are also put on the technologies.
At present, with the expansion of services and the growth of architecture resources, more and more resources need to be monitored, a great number of events (such as alarm events) often occur in the mode, and the events are complicated and redundant, so that great inconvenience is brought to operation and maintenance personnel in processing the events. However, in an actual scene, the events are usually related. For example, a dead halt of a host may cause an alarm of an application on the host and further cause an alarm of a service, so when an operation and maintenance person receives alarm information, there is a high possibility that an alarm of a bottom layer is submerged by an alarm of an upper layer, because events are usually sequenced and sent by time, it takes more time for the operation and maintenance person to process the events afterwards, and finally the recovery timeliness of the service is affected.
In the related art, the root cause of an event may be determined by a root cause analysis method, which is mainly dependent on experience of an operation and maintenance engineer and a development engineer. The root cause analysis method has lower accuracy and requires higher time cost and labor cost.
Disclosure of Invention
The embodiment of the application provides a root cause analysis method, which can solve the problem of lower accuracy of the root cause analysis method in the related technology.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a root cause analysis method, which comprises the following steps:
acquiring a dependency relationship among data nodes and events on the data nodes;
determining the weight of an event on a first data node according to the dependency relationship among the data nodes and/or the time sequence relationship of the event on the data nodes, wherein the first data node represents any one of the data nodes;
determining root cause relations among the events on the data nodes according to the dependency relations among the data nodes, the events on the data nodes and the weights of the events on the data nodes;
and under the condition that the events on the data nodes comprise target events, determining the root cause of the target events according to the root cause relation among the events on the data nodes.
In some embodiments of the present application, determining the weight of the event on the first data node according to the precedence relationship of the occurrence time of the event on each data node includes:
when a current occurrence event on a first data node is detected, the initial weight of the first data node is used as the weight of the current occurrence event.
It will be appreciated that since the initial weight of the first data node may be directly taken as the weight of the current occurrence, it may be easier to determine the weight of the current occurrence.
In some embodiments of the present application, determining the weight of the event on the first data node according to the precedence relationship of the occurrence time of the event on each data node includes:
when a current occurrence of an event on a first data node is detected and the first data node has a first historical occurrence of an event, increasing a value of a weight of the first historical occurrence of an event.
It can be appreciated that by increasing the value of the weight of the first historical occurred event, the weight of the first historical occurred event is better than the weight of the current occurred event, so that the relationship between the weights of the first historical occurred event and the current occurred event can be reflected more accurately.
In some embodiments of the present application, determining the weight of the event on the first data node according to the dependency relationship between the data nodes includes:
When detecting a current occurrence event on a first data node, and at least one second historical occurrence event exists in a dependent node of the first data node, increasing the value of the weight of the current occurrence event; the increase in the value of the weight of the current occurrence event is greater than or equal to the sum of the weights of the at least one second historically occurring event; the depended node of the first data node represents a data node that depends on the first data node.
It can be appreciated that, since the value of the weight of the current occurrence event can be increased, and the increase of the value of the weight of the current occurrence event is greater than or equal to the sum of the weights of the at least one second historical occurrence event, the weight of the current occurrence event on the first data node can be made greater than each event weight of the dependent node, and thus, the magnitude relation between the weight of the current occurrence event and the event weight of the dependent node on the first data node can reflect the dependency relation between the current occurrence event and the event of the dependent node more accurately.
In some embodiments of the present application, determining the weight of the event on the first data node according to the dependency relationship between the data nodes includes:
When detecting a current occurrence event on a first data node, and at least one third historical occurrence event exists in an ith-stage dependent node of the first data node, increasing the value of the weight of each third historical occurrence event in the at least one third historical occurrence event, wherein the increase of the value of the weight of each third historical occurrence event is larger than or equal to the highest weight of each event on the first data node; wherein i represents an integer greater than or equal to 1, the first data node depends on a level 1 dependent node of the first data node, and when i is greater than 1, the i-1 dependent node of the first data node depends on the i-th dependent node of the first data node.
It can be appreciated that, since the weight of the third historical occurred event of the i-th level dependent node of the first data node can be increased, and the value of the weight of each third historical occurred event is increased by an amount greater than or equal to the highest weight of each event on the first data node, the weight of the occurred event of the i-th level dependent node of the first data node can be made greater than the weight of each event of the first data node, and thus, the magnitude relation between the weight of the third historical occurred event and the occurred event weight of the first data node can reflect the dependency relationship between the event of the first data node and the event of the dependent node more accurately.
In some embodiments of the present application, determining the weight of the event on the first data node according to the dependency relationship between the data nodes includes:
when it is determined that a deleted event exists on the first data node and at least one third history occurred event exists on an i-th level dependent node of the first data node, reducing a value of a weight of each third history occurred event of the at least one third history occurred event by an amount equal to a weight of the deleted event.
As can be appreciated, when a deleted event exists on the first data node, since the weight of the third-history occurred event of the i-th-stage dependent node of the first data node can be reduced, and the reduction amount of the value of the weight of each third-history occurred event is equal to the weight of the deleted event, the association relationship between the event of the first data node and the event of the dependent node can be reflected more accurately.
In some embodiments of the present application, the determining the root cause relationship between the events on the data nodes according to the dependency relationship between the data nodes, the events on the data nodes, and the weights of the events on the data nodes includes:
Setting the initial value of the hierarchy of each data node to 0, and determining the effective data node of the 1 st hierarchy in each data node, wherein the effective data node of the 1 st hierarchy has an occurred event;
searching the depended node of the effective data node of the j-th level when j is an integer greater than or equal to 1; when an event occurs to the dependent node of the effective data node of the j-th level and the level of the dependent node of the effective data node of the j-th level is 0, updating the level of the dependent node of the effective data node of the j-th level to j+1; determining a dependency link when the depended node of the effective data node of the j-th level does not exist an event or the depended node of the effective data node of the j+1-th level does not exist, wherein the dependency link is used for representing the level dependency of the effective data node of each level;
determining root cause relations among the events on each data node according to the dependency links and the events on each data node; the root cause relationship comprises a weight size relationship of each event of the valid data nodes of the same hierarchy.
It can be seen that, in the embodiment of the present application, from the valid data nodes of the 1 st hierarchy, the valid data nodes of each hierarchy are accurately determined through the dependency relationship between the valid data nodes, so that the dependency relationship link is accurately determined, which is favorable for accurately determining the root cause relationship between the events on each data node. In addition, the root cause relation among the events on each data node can intuitively embody the vertical arrangement sequence of the events among different levels and the horizontal arrangement sequence among different events among the same level, and is beneficial to the operation and maintenance personnel to accurately analyze the root cause of the target event.
In some embodiments of the present application, the method further comprises:
determining that the node without the occurred event is an invalid data node in the data nodes; starting from the invalid data node, searching a relied node of the invalid data node; and when the dependent node of the invalid data node has an event, the dependent node of the invalid data node is used as the valid data node of the 1 st layer.
It can be understood that, since the invalid data node has no meaning to the root cause relationship between events, when the dependent node of the invalid data node is a valid data node, the dependent node of the invalid data node is suitable to be used as the head node of the dependency link, so that the dependent node of the invalid data node is used as the valid data node of the 1 st level, which is beneficial to accurately determining the head node of the dependency link, thereby being beneficial to accurately determining the complete dependency link.
In some embodiments of the present application, the determining the valid data node of the 1 st hierarchy among the data nodes includes:
screening effective data nodes without dependent nodes from the data nodes; and taking the valid data node without the dependency node as the valid data node of the 1 st level.
It can be appreciated that the valid data node without the dependency node is suitable as the head node of the dependency link, so that the valid data node without the dependency node is used as the valid data node of the 1 st layer, which is beneficial to accurately determining the head node of the dependency link, thereby being beneficial to accurately determining the complete dependency link.
In some embodiments of the present application, the method further comprises: and when the dependent node of the valid data node of the j-th level has an event and the level of the dependent node of the valid data node of the j-th level is not 0, increasing the value of the level of the dependent node of the valid data node of the j-th level by 1.
It can be seen that for the same data node that exists in different dependency links, a larger value of the hierarchy in the different dependency links can be taken, which is beneficial for accurately and uniquely determining the hierarchy of the same data node in the different dependency links.
The embodiment of the application provides a root cause analysis device, which comprises:
the acquisition module is used for acquiring the dependency relationship among the data nodes and the events on the data nodes;
the first processing module is used for determining the weight of the event on the first data node according to the dependency relationship among the data nodes and/or the time sequence relationship of the event on the data nodes; determining root cause relations among the events on the data nodes according to the dependency relations among the data nodes, the events on the data nodes and the weights of the events on the data nodes; the first data node represents any one of the data nodes;
and the second processing module is used for determining the root cause of the target event according to the root cause relation among the events on the data nodes under the condition that the events on the data nodes comprise the target event.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing any root cause analysis method when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium, which stores executable instructions for implementing any root cause analysis method when being executed by a processor.
In the embodiment of the application, firstly, the dependency relationship among the data nodes and the event on each data node are obtained; then, determining the weight of an event on a first data node according to the dependency relationship among the data nodes and/or the occurrence time sequence relationship of the event on the data nodes, wherein the first data node represents any one of the data nodes; determining root cause relations among the events on the data nodes according to the dependency relations among the data nodes, the events on the data nodes and the weights of the events on the data nodes; and finally, under the condition that the events on the data nodes comprise target events, determining the root cause of the target events according to the root cause relation among the events on the data nodes.
It can be seen that, according to the embodiment of the application, the root cause relation between the events on each data node can be determined according to the objectively existing dependency relation between each data node and the events on each data node, and experience of operation and maintenance engineers and development engineers is not needed, so that accuracy of a root cause analysis method is improved to a certain extent, and time cost and labor cost are reduced. Further, in the embodiment of the present application, priority ordering and relationship sorting may be performed on events of each data node according to dependency relationships between each data node and weights of events on each data node.
Drawings
FIG. 1 is a flow chart of a root cause analysis method according to an embodiment of the present application;
FIG. 2 is an exemplary system infrastructure diagram in accordance with an embodiment of the present application;
FIG. 3 is another exemplary system infrastructure diagram derived from FIG. 2 in accordance with an embodiment of the present application;
FIG. 4A is a diagram illustrating the association of a data structure of a data node with a data structure of an event in one embodiment of the present application;
FIG. 4B is a second schematic diagram of association of a data structure of a data node and a data structure of an event in an embodiment of the present application;
FIG. 5 is a flow chart of determining weights of events on data nodes with event addition in an embodiment of the present application;
FIG. 6 is a schematic diagram of an association relationship between a data structure including data nodes and a data structure of an event in the event addition in the embodiment of the present application;
FIG. 7 is a schematic diagram of another association relationship between a data structure including data nodes and a data structure of an event in the event addition in the embodiment of the present application;
FIG. 8 is a schematic diagram of a further association of a data structure including data nodes and a data structure of an event in the event addition case in an embodiment of the present application;
FIG. 9 is a schematic diagram of a further association of a data structure including data nodes and a data structure of an event in the event addition case in an embodiment of the present application;
FIG. 10 is a flow chart of determining the weight of an event on a data node in the event of an event deletion in an embodiment of the present application;
FIG. 11 is a schematic diagram of an association relationship between a data structure including data nodes and a data structure of an event in the event deletion case according to an embodiment of the present application;
FIG. 12 is yet another exemplary system infrastructure diagram derived in an embodiment of the present application;
fig. 13 is a first schematic diagram of determining a link index corresponding to a data node of each level in the embodiment of the present application;
fig. 14 is a second schematic diagram of determining a link index corresponding to a data node of each level in the embodiment of the present application;
fig. 15 is a second schematic diagram of determining a link index corresponding to a data node of each level in the embodiment of the present application;
FIG. 16 is a diagram of an exemplary event tree relationship resulting from an embodiment of the present application;
FIG. 17 is a schematic diagram of a root cause analysis device according to an embodiment of the present application;
fig. 18 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the examples provided herein are for the purpose of illustrating the present application only and are not intended to limit the present application. In addition, the embodiments provided below are some of the embodiments for implementing the present application, and not all of the embodiments for implementing the present application, and the technical solutions described in the embodiments of the present application may be implemented in any combination without conflict.
It should be noted that, in the embodiments of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such method or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other related elements in a method or apparatus comprising the element (e.g., a step in a method or an element in an apparatus, e.g., an element may be part of a circuit, part of a processor, part of a program or software, etc.).
For example, the root cause analysis method provided in the embodiment of the present application includes a series of steps, but the root cause analysis method provided in the embodiment of the present application is not limited to the described steps, and similarly, the root cause analysis apparatus provided in the embodiment of the present application includes a series of modules, but the apparatus provided in the embodiment of the present application is not limited to the explicitly described modules, and may also include modules that are required to be set for acquiring related information or performing processing based on the information.
In the related art, the root cause of an event may be determined by a root cause analysis method; the root cause analysis method in the related art is described below with an alarm event as an example. In order to quickly restore the service, in the face of hundreds of alarm events, an attempt may be made to classify the alarm event from the uppermost alarm event, for example, the alarm event may be classified into a host alarm event and a service alarm event, the host alarm event may be classified according to a type of a host and an operating system, the type of the host may include a virtual machine and a physical machine, and the operating system of the host may include Linux, windows, and other operating systems; the alarm event may also be classified into a central processing unit (Central Processing Unit, CPU) alarm event, a memory alarm, a disk alarm event, etc. according to the index. After classifying the alarm event, the alarm event may be added with a classification tag. Therefore, when an alarm event occurs, an operator can immediately determine the type of the alarm event, for example, when a physical machine alarm event, a virtual machine alarm event and an application alarm event occur at the same time, the root cause of the alarm event can be positioned according to the dependency relationship among the alarm events, and as the service application is deployed on the virtual machine and the virtual machine is deployed on the physical machine, the virtual machine and the service application can be influenced by the physical machine, and the problem can be rapidly positioned. In the root cause positioning scheme of the alarm event, the type of the alarm event is marked by the label, so that the problem that the alarm event is submerged is solved, and more concise and visual information can be provided.
However, in the related art, the root cause analysis scheme of the alarm event has the following disadvantages: firstly, the root cause analysis scheme of the alarm event mainly depends on experience of operation and maintenance personnel, so that accuracy of the root cause positioning scheme is reduced; secondly, the association relation between the alarm events can be obtained by inquiring the data dependency information in the configuration management database (Configuration Management Database, CMDB), but the relation of the label level of each alarm event obtained in the CMDB is not consistent with the relation of the actual alarm event; thirdly, the alarm events of the same kind are arranged only by virtue of time sequence relations, and the dependency relations of the alarm events of the same kind are not easy to determine.
For example, when the virtual machine a and the virtual machine B run on the physical machine C and the virtual machine a is abnormal, the memory consumption is increased, so that the memory on the physical machine C is increased, the available memory of the virtual machine B is reduced, and even the memory of the program overflows. In the above situation, the virtual machine a, the virtual machine B, and the physical machine C all generate alarm events, but if they are categorized according to the virtual machine labels, the virtual machines a and B are put together, and the physical machines are put into another class, and the physical machines may be directly judged from the dependency relationship to be the root cause of the alarm event, however, in actual situations, the virtual machine a is the root cause of the real alarm event.
In summary, the root cause analysis scheme of the alarm event in the related technology has the problem of lower accuracy.
Aiming at the technical problems, the technical scheme of the embodiment of the application is provided. Embodiments of the present application may be applied to terminals and/or servers where the terminals may be thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, programmable consumer electronics, network personal computers, small computer systems, and the like. The servers may be small computer systems, large computer systems, and distributed cloud computing technology environments including any of the above, among others.
An electronic device such as a server may include program modules for executing computer instructions. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.
Fig. 1 is a flowchart of a root cause analysis method according to an embodiment of the present application, as shown in fig. 1, the flowchart may include:
step 101: and acquiring the dependency relationship among the data nodes and the events on the data nodes.
In this embodiment, each data node may be a data node of a data system, where the data system may include different types of data nodes such as a system process, a host, a database, a disk storage, and load balancing, and a dependency relationship between each data node in the data system is known.
The dependency relationship among the data nodes in the data system can be presented through a system basic architecture diagram, the system basic architecture diagram can comprise attribute data of each data node, and the attribute data of each data node can comprise unique identifiers of other data nodes on which the data node depends, so that the dependency relationship of different data nodes can be declared.
For example, referring to fig. 2, a data system may include the following data nodes: the attribute data of each data node may be represented by the contents of a rectangular box, for example, in fig. 2, in the attribute data of the user management system process, the host computer vm-linux-01, the disk storage sal-01, the database mysql-01, the password management system process uniquely identified as the password-system, the host computer vm-linux-02 and the load balancing nginx-01, the host computer on which the user management system process depends is the host computer vm-linux-01, the database on which the user management system process depends is the database mysql-01, and the user management system process depends on the password management system process uniquely identified as the password-system.
In FIG. 2, the dependency between different data nodes may be represented by a line with an arrow, for example, a user management system process runs on host vm-linux-01, a 100GB disk of host vm-linux-01 comes from disk storage sas-01, the data of the user management system process is placed on database mysql-01, and the user management system process depends on a password management system process uniquely identified as password-system; in a practical scenario, when the user management system involves password authentication, authentication needs to be performed by using a password management system process uniquely identified as a password-system, and the password management system process uniquely identified as the password-system depends on a host vm-linux-02 and a load balancing nginx-01.
In this embodiment of the present application, a data structure of each data node may be defined, table 1 is a data structure table of the data node in this embodiment of the present application, in table 1, fields of the data structure of the data node may include a unique identification number (Identity Document, ID) field, an initial weight (PriorityValue, PV) field, a reference set field of a dependent node, and a reference set field of the present node, the unique ID field may also be referred to as a NodeId field, the initial weight field may be referred to as a PriorityValue field, the reference set field of the dependent node may be referred to as a right towelaedenodes field, the reference set field of the dependent node may be referred to as a reftowelaedenodes field, and the reference set field of the present node may be referred to as an Events field. Wherein the NodeId field represents a unique ID of the data node; the PriorityValue field represents an initial weight of the data node, and the initial weight of the data node is used for reflecting the importance degree of the data node; in an actual scenario, the importance degrees of different data nodes may be different, so different initial weights may be set for different data nodes, for example, a user management system process belongs to a process of a production application, and a password management system process belongs to a process of a test environment application, so the importance degrees of the user management system process and the password management system process are different, and different initial weights may be set for the user management system process and the password management system; the Events field indicates an event occurring at the data node, and in this embodiment of the present application, the event occurring at the data node may be various abnormal Events such as an alarm event.
In some embodiments, the fields of the data structure of the data nodes may also include a link index field, which may be referred to as a listndex field, which represents the level of data nodes in a link with link index as a dependency link, where the dependency link characterizes the level dependencies of each data node; illustratively, the initial value of the link index in the data structure of each data node may be set to 0, i.e., the initial value of the hierarchy of the respective data node is set to 0; illustratively, the dependency links may be presented through a tree graph.
In table 1, the LeftRelatedNodes field represents a set of depended nodes of a data node, the depended nodes of the data node represent data nodes that depend on the data node, for example, referring to fig. 2, the depended nodes of the disk storage sas-01 are hosts vm-linux-01, and the depended nodes of the database mysql-01 are user management system processes; the RightRaelatedNodes field represents a collection of dependent nodes of the data node that are used to characterize which data nodes the data node depends on, e.g., referring to FIG. 2, the dependent nodes of the password management system process, uniquely identified as password-system, include a host vm-linux-02 and a load balancing ngix-01; in practical application, the LeftRelatedNodes field and the rightlealatednodes field can only record address reference information of the dependent node and the dependent node, and detailed data structure information of the dependent node and the dependent node is not needed to be included, so that the space occupied by the data structure of the data node can be reduced.
It will be appreciated that by determining the dependent node and the depended node of each data node, the dependency relationship between the data nodes may be determined.
TABLE 1
In this embodiment of the present application, when an event of a data node occurs, a corresponding event may be recorded through a data structure of the event. Table 2 is a data structure table of an event in the embodiment of the present application, in table 2, fields of the data structure of the event may include a unique ID field, an event weight field, a corresponding data node ID field, an occurrence time field and a sub-event set field, where the unique ID field may be referred to as an EventId field, the event weight field may be referred to as a PriorityValue field, the corresponding data node ID field may be referred to as a MapNodeId field, the occurrence time field may be referred to as a StartTime field, the sub-event set field may be referred to as a child events field, where the EventId field represents a unique ID, priorityValue field of the event and the weight of the event may be changed due to the influence of other events; the MapNodeId field indicates the data node where the event is located, the StartTime field indicates the occurrence time point of the event, and the ChildrenEvents field indicates the event of the dependent node of the node where the event is located, i.e. other events that may be caused by the event.
TABLE 2
In this embodiment of the present application, the data structures of the data nodes are in one-to-one correspondence with the data nodes, so that the new system basic structure diagram can be obtained according to the description of the data nodes of the system basic structure diagram according to the data structures. Referring to fig. 3, the data structure of each data node in fig. 3 may be presented according to the fields in table 1. In fig. 3, the dependency between different data nodes can be represented by lines with arrows.
In this embodiment of the present application, the data structure of the data node in fig. 3 may be associated with the data structure of the event occurring on the data node, so as to obtain a schematic association diagram of the data structure of the data node and the data structure of the event, and referring to fig. 4A, there are a memory overflow alarm event and a user management system process survivability exception alarm event on the process node of the user management system. Referring to fig. 4B, there is a too high memory usage alert event on the host vm-linux-01 node.
Step 102: determining the weight of the event on the first data node according to the dependency relationship among the data nodes and/or the precedence relationship of the occurrence time of the event on each data node; determining root cause relations among the events on each data node according to the dependency relations among each data node, the events on each data node and the weights of the events on each data node; the first data node represents any one of the data nodes.
In some embodiments, root relationships between events on each data node may be represented by an event tree relationship graph, where a level of each data node in a dependency link may be used as a tree level of an event on a corresponding data node in the event tree relationship graph, that is, after determining a link index in a data structure of the data node, a value of the link index may be used as a tree level of an event on the corresponding data node in the event tree relationship graph.
In some embodiments, the tree hierarchy of the event in the event tree relationship graph may be presented in the data structure of the event, e.g., in table 2, the fields of the data structure of the event may also include a tree hierarchy (TreeIndex) field that represents the tree hierarchy of the event in the event tree relationship graph.
In some embodiments, the relationship between events on each data node may be determined according to the child events field in the data structure of the event and the tree level of the event in the event tree relationship graph, thereby determining the event tree relationship graph; the event tree relationship graph is used for representing root cause relationships among events on each data node.
In some embodiments, root relationships between events on each data node may also be shown; illustratively, the event tree relationship diagram described above may be presented; therefore, the root cause relation among the events can be obtained more intuitively and effectively by the event handler, and the root cause analysis of the events can be performed by the event handler.
In the embodiment of the application, the dependency relationship between the events on the data nodes between adjacent levels in the dependency relationship link can be determined according to the dependency relationship between the data nodes and the events on the data nodes; the priority order of the events on the data nodes of the same level in the dependency link can be determined according to the events on each data node and the weights of the events on each data node, and furthermore, the root relation among the events on the data nodes of the same level in the dependency link can be determined according to the dependency relation among the events on the data nodes of adjacent levels in the dependency link and the priority order of the events on the data nodes of the same level in the dependency link, so that the purpose of root cause analysis is achieved.
Step 103: in the case where the event on each data node includes a target event, a root cause of the target event is determined according to a root cause relationship between the events on each data node.
In the embodiment of the application, under the condition of obtaining the event tree-like relation diagram, determining the root cause of the target event according to the event tree-like relation diagram.
In practical applications, steps 101 to 103 may be implemented based on a processor of an electronic device, where the processor may be at least one of an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (Programmable Logic Device, PLD), a field programmable gate array (Field Programmable Gate Array, FPGA), a CPU, a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronic device implementing the above-described processor function may be other, and embodiments of the present application are not limited.
It can be seen that, according to the embodiment of the application, the root cause relation between the events on each data node can be determined according to the objectively existing dependency relation between each data node and the events on each data node, and experience of operation and maintenance engineers and development engineers is not needed, so that accuracy of a root cause analysis method is improved to a certain extent, and time cost and labor cost are reduced. Further, in the embodiment of the present application, priority ordering and relationship sorting may be performed on events of each data node according to dependency relationships between each data node and weights of events on each data node.
In the embodiment of the present application, the weight of the event on the first data node may be determined according to the event adding case and the event deleting case; the following description will be given separately.
1) Event addition.
In some embodiments, determining the implementation manner of the weight of the event on each data node according to the precedence relationship of the occurrence time of the event on each data node may include:
And when the current occurrence event on the first data node is detected, taking the initial weight of the first data node as the weight of the current occurrence event.
In some embodiments, referring to fig. 5, when an event occurs in a first data node is detected, a data structure of the current event may be established with reference to the above description, and the data node of the current event is matched, that is, the first data node is determined; then, a weight of the current occurrence of the event on the first data node may be determined, where the weight of the current occurrence of the event on the first data node is an initial weight of the first data node.
In some embodiments, referring to fig. 6, when an alarm event occurs for which the memory usage rate of the host vm-linux-01 is too high, the alarm event may be matched to the host vm-linux-01, and may be added in the reference set field of the present node of the data structure of the host vm-linux-01. The initial weight PV (vm-linux-01) of the host vm-linux-01 is taken as the weight PV (vm-linux-01-alarm-01) of the alarm event with too high memory usage, and the initial weight PV (vm-linux-01) of the host vm-linux-01 is 1.
It will be appreciated that since the initial weight of the first data node may be directly taken as the weight of the current occurrence, it may be easier to determine the weight of the current occurrence.
In some embodiments, when a current occurrence of an event on the first data node is detected and the first data node has a first historical occurrence of the event, the value of the weight of the first historical occurrence of the event is increased.
In this embodiment of the present application, the first historical event represents an event that has occurred before the current event on the first data node, and the first historical event may be one event or multiple events.
The amount of increase in the value of the weight of the first historical occurred event may be set according to actual needs, and the amount of increase in the value of the weight of the first historical occurred event may be greater than or equal to the weight of the current occurred event, for example, the amount of increase in the value of the weight of a historical occurred event may be 1 or other integer greater than 1.
For example, referring to fig. 5, after determining the weight of the current occurrence event, it may be determined whether a first historical occurrence event exists on the first data node, if so, the weight of the first historical occurrence event is added by 1, and then it is determined whether a relied node exists on the first data node; if the first history occurrence event does not exist on the first data node, whether the first data node has a relied node can be directly judged.
For example, referring to fig. 7, when a survivability exception alert event occurs in a user management system process, the survivability exception alert event may be matched to the user management system process and added in a reference set field of the present node of the user management system process. And taking the initial weight 1 of the user management system process as the weight of the survivability abnormal alarm event. Since there is already a memory overflow alarm event on the process of the user management system, in order to distinguish the importance degree of the alarm event of the same data node, different weights may be set for the memory overflow alarm event and the memory overflow alarm event, and since the alarm event that occurs first may be the source alarm time in general, the weight of the memory overflow alarm event should be higher than the weight of the memory overflow alarm event, for example, referring to fig. 7, the weight of the memory overflow alarm event is increased by 1, and the weight of the memory overflow alarm event becomes 2.
It can be appreciated that by increasing the value of the weight of the first historical occurred event, the weight of the first historical occurred event is better than the weight of the current occurred event, so that the relationship between the weights of the first historical occurred event and the current occurred event can be reflected more accurately.
In some embodiments, the value of the weight of the current occurrence event may be increased when the current occurrence event on the first data node is detected and there is at least one second historical occurrence event on the dependent node of the first data node; the increase in the value of the weight of the current occurrence event is greater than or equal to the sum of the weights of the at least one second historical occurrence event; the depended node of the first data node represents a data node that depends on the first data node.
For example, referring to fig. 5, it may be determined whether the first data node has a relied node, and if the first data node does not have a relied node, it is determined whether the first data node has a relied node; if the first data node has the depended node, continuously judging whether a second history occurred event exists on the depended node, and if the second history occurred event does not exist on the depended node, judging whether the first data node has the depended node; if a second historical event occurs on the dependent node, the value of the weight of the current event increases, and then, whether the first data node has the dependent node is judged.
For example, referring to FIG. 8, when a capacity usage too high alert event occurs in disk storage SAS-01, the capacity usage too high alert event may be matched to disk storage SAS-01 and added in the Events field of disk storage SAS-01. The initial value of the weight for a capacity usage too high alert event may be set to the initial weight 1 of disk storage sas-01. Since disk storage sas-01 is relied upon by host vm-linux-01, i.e., disk storage sas-01 exists as a relied node, in this case, the sum of the weights of events on host vm-linux-01 can be counted, and in fig. 8, the sum of the weights of events on host vm-linux-01 is 2, so the weight of a capacity usage too high alarm event increases by 2, i.e., the weight PV (sas-01-alarm-01) of a capacity usage too high alarm event is equal to the sum of the initial weights PV (sas-01) and PV (vm-linux-alarm-01) of disk storage sas-01, and PV (vm-linux-alarm-01) represents the sum of the weights of events on host vm-linux-01.
It can be appreciated that, since the value of the weight of the current occurrence event can be increased, and the increase of the value of the weight of the current occurrence event is greater than or equal to the sum of the weights of the at least one second historical occurrence event, the weight of the current occurrence event on the first data node can be made greater than each event weight of the dependent node, and thus, the magnitude relation between the weight of the current occurrence event and the event weight of the dependent node on the first data node can reflect the dependency relation between the current occurrence event and the event of the dependent node more accurately.
In some embodiments, when a current occurrence of an event on the first data node is detected, and at least one third historical occurrence of the event exists in the i-th level dependent node of the first data node, increasing a value of a weight of each third historical occurrence of the event by an amount greater than or equal to a highest weight of the events on the first data node; where i represents an integer greater than or equal to 1, the first data node depends on a level 1 dependency node of the first data node, and when i is greater than 1, the i-1 level dependency node of the first data node depends on the i-th level dependency node of the first data node.
Referring to FIG. 3, the level 1 dependent nodes of the user management system process include a host vm-linux-01, a database mysql-01, and a password management system process; since the host vm-linux-01 depends on the disk storage sas-01, the database mysql-01 depends on the disk storage sas-02, and the password management system process depends on the host vm-linux-02 and the load balancing nginx-01, the level 2 dependent nodes of the user management system process comprise the disk storage sas-01, the disk storage sas-02, the host vm-linux-02 and the load balancing nginx-01; since the host vm-linux-02 depends on the disk storage sas-02, the disk storage sas-02 is also a level 3 dependent node of the user management system process.
Illustratively, referring to fig. 5, it may be determined whether the first data node has a dependency node, and if the first data node does not have a dependency node, the flow is ended; if the first data node has a dependent node, judging whether the dependent node has a third historical event, and if the dependent node does not have the third historical event, ending the flow; if the dependency node has a third history of events occurring, increasing the value of the weight of the dependency node
For example, referring to fig. 7, after the value of the weight of the memory overflow alert event is changed to 2, the weight of the memory overflow alert event is the highest weight of each alert event on the user management system process, in which case, since the host vm-linux-01 is the 1 st level dependent node of the user management system process, the disk storage sas-01 is the 2 nd level dependent node of the user management system process, and the memory usage too high alert event already exists on the host vm-linux-01, the capacity usage too high alert event already exists on the disk storage sas-01, and therefore, the weight of the memory usage too high alert event on the host vm-linux-01 and the capacity usage too high alert event on the disk storage sas-01 may be increased by 2, the weight of the memory usage too high alert event on the host vm-linux-01 becomes 4, and the weight of the capacity usage too high alert event on the disk storage sas-01 becomes 5.
For example, referring to fig. 9, when a memory overflow alert event occurs in a user management system process, the memory overflow alert event may be matched to the user management system process and may be added in a reference set field of the present node of the user management system process. And taking the initial weight 1 of the user management system process as the weight of the memory overflow alarm event. Since the host vm-linux-01 is the level 1 dependent node of the user management system process, and the host vm-linux-01 has the alarm event with too high memory usage, the weight of the alarm event with too high memory usage on the host vm-linux-01 can be increased by 1, and the weight of the alarm event with too high memory usage on the host vm-linux-01 becomes 2, that is, the sum of the weight PV (vm-linux-01-alarm-01) of the alarm event with too high memory usage on the host vm-linux-01 and the weight PV (user-system-alarm-01) of the alarm event with memory overflow is the initial weight PV (vm-linux-01) of the host vm-linux-01.
It can be appreciated that, since the weight of the third historical occurred event of the i-th level dependent node of the first data node can be increased, and the value of the weight of each third historical occurred event is increased by an amount greater than or equal to the highest weight of each event on the first data node, the weight of the occurred event of the i-th level dependent node of the first data node can be made greater than the weight of each event of the first data node, and thus, the magnitude relation between the weight of the third historical occurred event and the occurred event weight of the first data node can reflect the dependency relationship between the event of the first data node and the event of the dependent node more accurately.
2) Event deletion.
In some embodiments, when it is determined that a deleted event exists on the first data node and that the i-th level dependent node of the first data node has at least one third history of occurred events, the value of the weight of each third history of occurred events is reduced by an amount equal to the weight of the deleted event.
In the embodiment of the application, after the event is added, the event can be deleted; for example, after the alarm event is added, if the alarm event is restored (i.e., there is no alarm event), the corresponding event may be deleted in the reference set field of the present node of the data structure of the first data node. The weight of the occurred event on the i-th level dependent node of the first data node may then be changed for the deleted event on the first data node.
In some embodiments, referring to fig. 10, when there is a deleted event, it may be attempted to match the deleted time with each data node, if any data node is not matched, the flow is ended, and if the data node can be matched, it is determined whether there is a dependent node in the matched data node; taking the matched data node as a first data node as an example for explanation, if the first data node does not have a dependent node, ending the flow after deleting the event in the first data node; if the first data node has the dependent node, judging whether the dependent node of the first data node has the event, if the dependent node of the first data node does not have the event, ending the flow after deleting the event in the first data node; if the dependent node of the first data node has an occurred event, the value of the weight of the occurred event of the dependent node of the first data node is reduced, and then, the step of judging whether the dependent node exists in the first data node is returned. It can be seen that by reducing the weight of the occurred event of the dependent node of the first data node, the value of the weight of the occurred event can no longer be in an increasing trend.
For example, referring to fig. 11, when the event connected to the user management system process through the dotted line is the recovered event, the memory overflow alarm event may be matched to the user management system process when the memory overflow alarm event of the user management system process is recovered, and then the memory overflow alarm event is deleted in the reference set field of the node of the data structure of the user management system process; because the 1 st level of the dependent node of the user management system process is the host vm-linux-01, and the 2 nd level of the dependent node of the user management system process is the disk storage sas-01, the weight of the alarm event with too high memory usage rate on the host vm-linux-01 can be reduced by 2, the weight of the alarm event with too high capacity usage rate on the disk storage sas-01 is reduced by 2, namely, the weight of the alarm event with too high memory usage rate on the host vm-linux-01 becomes 2, and the weight of the alarm event with too high capacity usage rate on the disk storage sas-01 becomes 3.
As can be appreciated, when a deleted event exists on the first data node, since the weight of the third-history occurred event of the i-th-stage dependent node of the first data node can be reduced, and the reduction amount of the value of the weight of each third-history occurred event is equal to the weight of the deleted event, the association relationship between the event of the first data node and the event of the dependent node can be reflected more accurately.
In summary, in the embodiment of the present application, the weights of the events on the data nodes may be dynamically changed according to the event adding and event deleting situations, that is, the weights of the events on the data nodes may be timely and accurately determined, so that the relationships between the events of the data nodes may be timely and accurately determined.
In some embodiments, determining the implementation of root cause relationships between events on the data nodes according to the dependency relationships between the data nodes, the events on the data nodes, and the weights of the events on the data nodes may include:
setting the initial value of the hierarchy of each data node to 0, and determining the effective data node of the 1 st hierarchy in each data node, wherein the effective data node of the 1 st hierarchy has an occurred event;
searching the depended node of the effective data node of the j-th level when j is an integer greater than or equal to 1; when an event occurs to the dependent node of the effective data node of the j-th level and the level of the dependent node of the effective data node of the j-th level is 0, updating the level of the dependent node of the effective data node of the j-th level to j+1; determining a dependency link when the depended node of the effective data node of the j-th level does not exist an event or the depended node of the effective data node of the j+1-th level does not exist, wherein the dependency link is used for representing the level dependency of the effective data node of each level;
And determining root cause relations among the events on each data node according to the dependency links and the events on each data node.
In the embodiment of the application, the type of the data node can be determined by judging whether an event occurs in the data node; if the event occurs in the data node (the reference set field of the data node of the data structure of the data node is not null), determining the data node as a valid data node; if there is no event that has occurred in the data node (the reference set field of the data node's data structure's own node is null), the data node is determined to be an invalid data node.
Since the embodiment of the application is used for determining the root cause relationship between the events and the event which has occurred does not exist in the invalid data nodes, the invalid data nodes are not significant to the root cause relationship between the events, so that the invalid data nodes do not need to be considered when determining the root cause relationship between the events, and the value of the ListIndex field of the data structure of the invalid data nodes is always 0, namely, the invalid data nodes are not in the dependency relationship link.
In some embodiments, valid data nodes without dependency nodes can be screened out of the data nodes; and taking the valid data node without the dependency node as a valid data node of the 1 st layer.
In the embodiment of the application, whether the data node has a dependent node can be judged through a RightRaelatedNodes field of a data structure of the data node; if the RightRaelatedNodes field of the data structure of the data node is empty, indicating that the data node does not have a dependent node; if the rightratelatednodes field of the data structure of the data node is not null, it is indicated that the data node has a dependent node.
It can be appreciated that the valid data node without the dependency node is suitable as the head node of the dependency link, so that the valid data node without the dependency node is used as the valid data node of the 1 st layer, which is beneficial to accurately determining the head node of the dependency link, thereby being beneficial to accurately determining the complete dependency link.
In this embodiment of the present application, when j is an integer greater than or equal to 1, whether the effective data node of the j-th level has a relied node may be determined through a leftrateledddes field of a data structure of the relied node of the effective data node of the j-th level, and if the effective data node of the j-th level does not have a relied node, a dependency link may be determined, where the dependency link includes a data node link formed by the effective data node of the 1-th level to the effective data node of the j-th level.
If the valid data node of the j-th hierarchy has a depended node, but the depended node of the valid data node of the j-th hierarchy is an invalid node, a dependency link may be determined, the dependency link including one data node link composed of the valid data node of the 1-th hierarchy to the valid data node of the j-th hierarchy.
In this embodiment of the present application, if the level of the depended node of the valid data node of the jth level is 0, it is stated that the depended node of the valid data node of the jth level does not currently exist in other dependency links, and at this time, the level of the depended node of the valid data node of the jth level may be updated to j+1, that is, the valid data node of the jth+1 level may be determined, so that the dependency links may be gradually determined by continuously determining the valid data node of the next level.
It can be seen that, for the same dependency link, the values of the link indexes used for representing the levels in different data nodes are not the same, and the link indexes corresponding to the data nodes with higher levels in the same dependency link are higher.
In the embodiment of the present application, after determining the dependency link, the level of each data node of the dependency link may be used as the level of the event on the corresponding data node, and then, in combination with the dependency of each data node in the dependency link and the level of the event on each data node in the dependency link, an event tree relationship graph that characterizes the root cause relationship between the events is determined.
It can be seen that, in the embodiment of the present application, from the valid data nodes of the 1 st hierarchy, the valid data nodes of each hierarchy are accurately determined through the dependency relationship between the valid data nodes, so that the dependency relationship link is accurately determined, which is favorable for accurately determining the root cause relationship between the events on each data node.
In some embodiments, it may be determined, among the data nodes, that there is no node for which an event has occurred as an invalid data node; starting from the invalid data node, searching the relied node of the invalid data node; and when the dependent node of the invalid data node has an event, the dependent node of the invalid data node is used as the valid data node of the 1 st level.
Here, as is clear from the above description, when searching for the dependent node of the valid data node of the j-th hierarchy, if the dependent node of the valid data node of the j-th hierarchy is an invalid data node, the dependent node of the invalid data node may be searched for starting from the invalid data node; in this embodiment of the present application, each time an invalid data node is found, a dependent node of the invalid data node may be found, and if the dependent node of the invalid data node is a valid data node, the dependent node of the invalid data node may be used as a valid data node of level 1.
It can be understood that, since the invalid data node has no meaning to the root cause relationship between events, when the dependent node of the invalid data node is a valid data node, the dependent node of the invalid data node is suitable to be used as the head node of the dependency link, so that the dependent node of the invalid data node is used as the valid data node of the 1 st level, which is beneficial to accurately determining the head node of the dependency link, thereby being beneficial to accurately determining the complete dependency link.
In some embodiments, the value of the level of the depended node of the valid data node of the j-th level may be increased by 1 when there is an event that has occurred in the depended node of the valid data node of the j-th level and the level of the depended node of the valid data node of the j-th level is not 0.
In this embodiment of the present application, if the depended node of the valid data node of the jth level is a valid data node, and the depended node of the valid data node of the jth level is not 0, it is indicated that the depended node of the valid data node of the jth level currently exists in other dependency links, and at this time, the level of the depended node of the valid data node of the jth level may take a larger value of the levels in different dependency links, that is, the level of the depended node of the valid data node of the jth level is updated to j+1. In practical implementation, the link index corresponding to the depended node of the effective data node of the j-th level can be added with 1, which is beneficial to reducing the situation that the link index corresponding to the data node conflicts.
It can be seen that for the same data node that exists in different dependency links, a larger value of the hierarchy in the different dependency links can be taken, which is beneficial for accurately and uniquely determining the hierarchy of the same data node in the different dependency links.
In some embodiments, in the case of determining a root cause relationship between events on each data node according to a dependency relationship between each data node, an event on each data node, and a weight of an event on each data node, the root cause relationship may reflect a weight magnitude relationship of each event of an effective data node of the same hierarchy.
In the embodiment of the present application, the root cause relationship between the events on each data node may be determined according to the dependency link, the events on each data node, and the weights of the events on each data node. The arrangement sequence of the effective data nodes in the same level can be determined according to the magnitude relation of the event weights, so that the root cause relation among the events on each data node can intuitively embody the vertical arrangement sequence of the events among different levels and the horizontal arrangement sequence among the different events among the same level, and the operation and maintenance personnel can accurately analyze the root cause of the target event.
Further, according to the embodiment of the application, the root cause relation among the events without loop dependency can be determined according to the weight magnitude relation of each event of the effective data nodes of the same hierarchy and the inherent unidirectional dependency relation among the data nodes.
The root cause relation positioning scheme of the embodiment of the application is further described by an embodiment.
Tables 3 through 8 show the data structures of 6 events, which are the disk storage sas-02 capacity usage too high alarm event, the host vm-linux-02 memory usage too high alarm event, the password management system process exception alarm event, the database mysql-01 slow query too many alarm event, the user management system process memory overflow alarm event, and the user management system process survivability exception alarm event, respectively. From the above description, the weights of the 6 events can be determined; in tables 3 to 8, the sub-event set of each event is initialized to null, and the initial value of the tree hierarchy of each event is 0.
TABLE 3 Table 3
TABLE 4 Table 4
/>
TABLE 5
TABLE 6
TABLE 7
/>
TABLE 8
Referring to FIG. 12, the event set for the user management system process includes two events, and the event set on the database mysql-01, disk storage sas-02, password management system process, and host vm-linux-02 includes one event; and no event exists on the host vm-linux-01, the disk storage sas-01 and the load balancing nginx-01, and the initial value of the link index corresponding to each data node is 0.
For the dependency relationship between the data nodes shown in fig. 12, the data nodes with empty reference set fields of the dependency nodes can be screened from fig. 12, and the data nodes with empty reference set fields of the dependency nodes comprise disk storage sas-01, disk storage sas-02 and load balancing nginx-01.
And screening out valid data nodes from the data nodes with empty reference set fields of the dependent nodes, namely screening out data nodes with non-empty reference set fields of the dependent nodes, wherein the disk storage sas-01 and the load balancing nginx-01 are invalid data nodes, and the disk storage sas-02 is the valid data node.
After disk storage sas-02 is used as the valid data node of the 1 st hierarchy, the link index corresponding to disk storage sas-02 is set to 1. Then, starting from the disk storage sas-02, gradually searching for the valid data nodes of each level according to the description, after searching for the valid data node of the j-th level, setting the link index corresponding to the valid data node of the j-th level as j, for example, referring to fig. 13, the depended node of the disk storage sas-02 is the database mysql-01 and the host vm-linux-02, so that the database mysql-01 and the host vm-linux-02 are both data nodes of the 2-th level, and the link index corresponding to the database mysql-01 and the host vm-linux-02 can be set as 2.
Referring to fig. 14, starting from the database mysql-01 and the host vm-linux-02, the dependent nodes of the database mysql-01 and the host vm-linux-02 are searched, and based on the above description, the link subscripts corresponding to the user management system process and the password management system process may be denoted by 3.
Referring to FIG. 15, since the reference set field of the relied node of the user management system process is null, it is illustrated that the relied node of the user management system process is an end point of the dependency link; the reference set of the relied node of the password management system process also has a user-system, which indicates that the user management system process can be found from the password management system process, but the link index corresponding to the user management system process is not 0, which indicates that the user management system process is in other links, at the moment, the value of the link index corresponding to the user management system process can be increased by 1, and the value of the link index corresponding to the user management system process is changed into 4.
In connection with fig. 12-15, a dependency link may be determined.
After the dependency link is determined, a value of a link index corresponding to each data node may be assigned to a tree-level field of the event that has occurred on the data node, and each event may be associated through a set of sub-events in the data structure of the event.
Referring to FIG. 15, the tree hierarchy of the disk storage sas-02 capacity usage too high alert event is equal to the link index 1 corresponding to the disk storage sas-02, the tree hierarchy of the database mysql-01 slow query too many alert event is equal to the link index 2 corresponding to the database mysql-01, and because the database mysql-01 is in the reference set field of the relied node corresponding to the disk storage sas-02, the database mysql-01 slow query too many alert events can be added to the sub-event set of the disk storage sas-02 capacity usage too high alert event.
In this embodiment of the present application, starting from the disk storage sas-02, referring to the above description, the tree level and the field values of the sub-event set in the data structure of each data node in the dependency link may be gradually determined, and tables 9 to 14 show the data structures of each event.
TABLE 9
Table 10
TABLE 11
Table 12
TABLE 13
TABLE 14
After determining the data structures of the events shown in tables 9 to 14, determining an event tree-like relation diagram according to the tree levels corresponding to the events; in the event tree relation diagram, the sequence from top to bottom reflects the sequence from small to large of the tree levels of the events, the smaller the tree level is, the more likely the tree level is the source event, and in the same tree level, different event data can be ordered from left to right through the weight of the event, and the events among different tree levels can be associated through sub-event set fields in the data structure of the event.
FIG. 16 is a tree-like relationship diagram of an exemplary event finally obtained in an embodiment of the present application, as shown in FIG. 16, where the weight of an alarm event with too high capacity usage rate of the disk storage sas-02 is highest, and an alarm event with too high capacity usage rate of the disk storage sas-02 may cause occurrence of an alarm event with too high capacity usage rate of the host vm-linux-02 and too many alarm events slowly queried by the database mysql-01, thereby causing occurrence of an abnormal alarm event of a password management system process, and finally causing overflow of memory of a user management system process depending on the password management system process and using the database mysql-01, and causing occurrence of an abnormal alarm event of a survivability of the user management system process.
On the basis of the root cause analysis method provided by the embodiment, the embodiment of the application also provides a root cause analysis device; fig. 17 is a schematic diagram of an alternative composition structure of a root cause analysis device according to an embodiment of the present application, and as shown in fig. 17, the root cause analysis device 170 may include:
an obtaining module 171, configured to obtain a dependency relationship between data nodes and an event on each data node;
a first processing module 172, configured to determine a weight of an event on a first data node according to a dependency relationship between the data nodes and/or a precedence relationship of occurrence times of events on the data nodes; determining root cause relations among the events on the data nodes according to the dependency relations among the data nodes, the events on the data nodes and the weights of the events on the data nodes; the first data node represents any one of the data nodes;
The second processing module 173 is configured to determine, when the event on each data node includes a target event, a root cause of the target event according to a root cause relationship between the events on each data node.
In some embodiments, the first processing module 172 is configured to determine weights of events on the data nodes according to a precedence relationship of occurrence times of the events on the data nodes, and includes:
when a current occurrence event on a first data node is detected, the initial weight of the first data node is used as the weight of the current occurrence event.
In some embodiments, the first processing module 172 is configured to determine a weight of the event on the first data node according to a precedence relationship of occurrence times of the events on the data nodes, and includes:
when a current occurrence of an event on a first data node is detected and the first data node has a first historical occurrence of an event, increasing a value of a weight of the first historical occurrence of an event.
In some embodiments, the first processing module 172 is configured to determine a weight of an event on the first data node according to a dependency relationship between the data nodes, including:
When detecting a current occurrence event on a first data node, and at least one second historical occurrence event exists in a dependent node of the first data node, increasing the value of the weight of the current occurrence event; the increase in the value of the weight of the current occurrence event is greater than or equal to the sum of the weights of the at least one second historically occurring event; the depended node of the first data node represents a data node that depends on the first data node.
In some embodiments, the first processing module 172 is configured to determine a weight of an event on the first data node according to a dependency relationship between the data nodes, including:
when detecting a current occurrence event on a first data node, and at least one third historical occurrence event exists in an ith-stage dependent node of the first data node, increasing the value of the weight of each third historical occurrence event in the at least one third historical occurrence event, wherein the increase of the value of the weight of each third historical occurrence event is larger than or equal to the highest weight of each event on the first data node; wherein i represents an integer greater than or equal to 1, the first data node depends on a level 1 dependent node of the first data node, and when i is greater than 1, the i-1 dependent node of the first data node depends on the i-th dependent node of the first data node.
In some embodiments, the first processing module 172 is configured to determine a weight of an event on the first data node according to a dependency relationship between the data nodes, including:
when it is determined that a deleted event exists on the first data node and at least one third history occurred event exists on an i-th level dependent node of the first data node, reducing a value of a weight of each third history occurred event of the at least one third history occurred event by an amount equal to a weight of the deleted event.
In some embodiments, the first processing module 172 is configured to determine a root cause relationship between events on the data nodes according to the dependency relationship between the data nodes, the events on the data nodes, and the weights of the events on the data nodes, and includes:
setting the initial value of the hierarchy of each data node to 0, and determining the effective data node of the 1 st hierarchy in each data node, wherein the effective data node of the 1 st hierarchy has an occurred event;
searching the depended node of the effective data node of the j-th level when j is an integer greater than or equal to 1; when an event occurs to the dependent node of the effective data node of the j-th level and the level of the dependent node of the effective data node of the j-th level is 0, updating the level of the dependent node of the effective data node of the j-th level to j+1; determining a dependency link when the depended node of the effective data node of the j-th level does not exist an event or the depended node of the effective data node of the j+1-th level does not exist, wherein the dependency link is used for representing the level dependency of the effective data node of each level;
Determining root cause relations among the events on each data node according to the dependency links and the events on each data node; the root cause relationship comprises a weight size relationship of each event of the valid data nodes of the same hierarchy.
In some embodiments, the first processing module 172 is further configured to:
determining that the node without the occurred event is an invalid data node in the data nodes; starting from the invalid data node, searching a relied node of the invalid data node; and when the dependent node of the invalid data node has an event, the dependent node of the invalid data node is used as the valid data node of the 1 st layer.
In some embodiments, the first processing module 172 is configured to determine a valid data node of level 1 from the data nodes, including:
screening effective data nodes without dependent nodes from the data nodes; and taking the valid data node without the dependency node as the valid data node of the 1 st level.
In some embodiments, the first processing module 172 is further configured to increase the value of the level of the depended node of the valid data node of the j-th level by 1 when there is an event that has occurred in the depended node of the valid data node of the j-th level and the level of the depended node of the valid data node of the j-th level is not 0.
In some embodiments, the second processing module 173 is further configured to show root cause relationships between events on the data nodes.
In practical applications, the acquiring module 171, the first processing module 172 and the second processing module 173 may be implemented by a processor of an electronic device, where the processor may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, a controller, a microcontroller and a microprocessor. It will be appreciated that the electronic device implementing the above-described processor function may be other, and embodiments of the present application are not limited.
It should be noted that the description of the above device embodiments is similar to the description of the method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.
It should be noted that, in the embodiment of the present application, if the method is implemented in the form of a software functional module, and sold or used as a separate product, the method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or portions contributing to the prior art may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Correspondingly, the embodiment of the application further provides a computer program product, which comprises computer executable instructions for implementing any root cause analysis method provided by the embodiment of the application.
Accordingly, an embodiment of the present application further provides a computer storage medium, where computer executable instructions are stored on the computer storage medium, where the computer executable instructions are configured to implement any one of the root cause analysis methods provided in the foregoing embodiments.
An electronic device is further provided in the embodiments of the present application, and fig. 18 is a schematic diagram of an optional composition structure of the electronic device provided in the embodiments of the present application, as shown in fig. 18, where the electronic device 180 includes:
a memory 181 for storing executable instructions;
and a processor 182, configured to implement any of the root cause analysis methods described above when executing the executable instructions stored in the memory 181.
The processor 182 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, a controller, a microcontroller, and a microprocessor.
The computer readable storage medium/Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable programmable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a magnetic random access Memory (Ferromagnetic Random Access Memory, FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Read Only optical disk (Compact Disc Read-Only Memory, CD-ROM); but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.
It should be noted here that: the description of the storage medium and apparatus embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the apparatus of the present application, please refer to the description of the method embodiments of the present application for understanding.
It should be appreciated that reference throughout this specification to "some embodiments" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrase "in some embodiments" in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present application.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the related art, embodied in the form of a software product stored in a storage medium, including several instructions for causing an apparatus automatic test line to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
The methods disclosed in the several method embodiments provided in the present application may be arbitrarily combined without collision to obtain a new method embodiment.
The features disclosed in the several method or apparatus embodiments provided in the present application may be arbitrarily combined without conflict to obtain new method embodiments or apparatus embodiments.
The foregoing is merely an embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A root cause analysis method, the method comprising:
acquiring a dependency relationship among data nodes in a data system and an event on each data node;
the dependency relationship among the data nodes in the data system is presented through a system basic architecture diagram, the system basic architecture diagram comprises attribute data of each data node, and the attribute data of each data node comprises unique identifiers of other data nodes on which the data node depends, so that the dependency relationship of different data nodes is declared;
determining the weight of an event on a first data node according to the dependency relationship among the data nodes and/or the time sequence relationship of the event on the data nodes, wherein the first data node represents any one of the data nodes;
determining root cause relations among the events on the data nodes according to the dependency relations among the data nodes, the events on the data nodes and the weights of the events on the data nodes;
and under the condition that the events on the data nodes comprise target events, determining the root cause of the target events according to the root cause relation among the events on the data nodes.
2. The method of claim 1, wherein determining the weight of the event on the first data node based on the chronological relationship of the event on each data node, comprises:
when a current occurrence event on a first data node is detected, the initial weight of the first data node is used as the weight of the current occurrence event.
3. The method of claim 1, wherein determining the weight of the event on the first data node based on the chronological relationship of the event on each data node, comprises:
when a current occurrence of an event on a first data node is detected and the first data node has a first historical occurrence of an event, increasing a value of a weight of the first historical occurrence of an event.
4. The method of claim 1, wherein determining the weight of the event on the first data node based on the dependency between the data nodes comprises:
when detecting a current occurrence event on a first data node, and at least one second historical occurrence event exists in a dependent node of the first data node, increasing the value of the weight of the current occurrence event; the increase in the value of the weight of the current occurrence event is greater than or equal to the sum of the weights of the at least one second historically occurring event; the depended node of the first data node represents a data node that depends on the first data node.
5. The method of claim 1, wherein determining the weight of the event on the first data node based on the dependency between the data nodes comprises:
when detecting a current occurrence event on a first data node, and at least one third historical occurrence event exists in an ith-stage dependent node of the first data node, increasing the value of the weight of each third historical occurrence event in the at least one third historical occurrence event, wherein the increase of the value of the weight of each third historical occurrence event is larger than or equal to the highest weight of each event on the first data node; wherein i represents an integer greater than or equal to 1, the first data node depends on a level 1 dependent node of the first data node, and when i is greater than 1, the i-1 dependent node of the first data node depends on the i-th dependent node of the first data node.
6. The method of claim 5, wherein determining the weight of the event on the first data node based on the dependency between the data nodes comprises:
when it is determined that a deleted event exists on the first data node and at least one third history occurred event exists on an i-th level dependent node of the first data node, reducing a value of a weight of each third history occurred event of the at least one third history occurred event by an amount equal to a weight of the deleted event.
7. The method according to any one of claims 1 to 6, wherein determining the root relation between the events on the data nodes according to the dependency relation between the data nodes, the events on the data nodes, and the weights of the events on the data nodes comprises:
setting the initial value of the hierarchy of each data node to 0, and determining the effective data node of the 1 st hierarchy in each data node, wherein the effective data node of the 1 st hierarchy has an occurred event;
searching the depended node of the effective data node of the j-th level when j is an integer greater than or equal to 1; when an event occurs to the dependent node of the effective data node of the j-th level and the level of the dependent node of the effective data node of the j-th level is 0, updating the level of the dependent node of the effective data node of the j-th level to j+1; determining a dependency link when the depended node of the effective data node of the j-th level does not exist an event or the depended node of the effective data node of the j+1-th level does not exist, wherein the dependency link is used for representing the level dependency of the effective data node of each level;
Determining root cause relations among the events on each data node according to the dependency links and the events on each data node; the root cause relationship comprises a weight size relationship of each event of the valid data nodes of the same hierarchy.
8. The method of claim 7, wherein the method further comprises:
determining that the node without the occurred event is an invalid data node in the data nodes; starting from the invalid data node, searching a relied node of the invalid data node; and when the dependent node of the invalid data node has an event, the dependent node of the invalid data node is used as the valid data node of the 1 st layer.
9. The method of claim 7, wherein said determining a valid data node of level 1 among said data nodes comprises:
screening effective data nodes without dependent nodes from the data nodes; and taking the valid data node without the dependency node as the valid data node of the 1 st level.
10. The method of claim 7, wherein the method further comprises: and when the dependent node of the valid data node of the j-th level has an event and the level of the dependent node of the valid data node of the j-th level is not 0, increasing the value of the level of the dependent node of the valid data node of the j-th level by 1.
CN202110610565.1A 2021-06-01 2021-06-01 Root cause analysis method Active CN113326161B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110610565.1A CN113326161B (en) 2021-06-01 2021-06-01 Root cause analysis method
PCT/CN2021/132783 WO2022252512A1 (en) 2021-06-01 2021-11-24 Root cause analysis method and apparatus, electronic device, medium, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110610565.1A CN113326161B (en) 2021-06-01 2021-06-01 Root cause analysis method

Publications (2)

Publication Number Publication Date
CN113326161A CN113326161A (en) 2021-08-31
CN113326161B true CN113326161B (en) 2024-02-06

Family

ID=77423202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110610565.1A Active CN113326161B (en) 2021-06-01 2021-06-01 Root cause analysis method

Country Status (2)

Country Link
CN (1) CN113326161B (en)
WO (1) WO2022252512A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326161B (en) * 2021-06-01 2024-02-06 深圳前海微众银行股份有限公司 Root cause analysis method
CN114710397B (en) * 2022-04-24 2024-02-06 中国工商银行股份有限公司 Service link fault root cause positioning method and device, electronic equipment and medium
CN116128571B (en) * 2023-04-12 2023-07-07 花瓣云科技有限公司 Advertisement exposure analysis method and related device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888755A (en) * 2019-11-15 2020-03-17 亚信科技(中国)有限公司 Method and device for searching abnormal root node of micro-service system
CN112087334A (en) * 2020-09-09 2020-12-15 中移(杭州)信息技术有限公司 Alarm root cause analysis method, electronic device and storage medium
CN112118141A (en) * 2020-09-21 2020-12-22 中山大学 Communication network-oriented alarm event correlation compression method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10263833B2 (en) * 2015-12-01 2019-04-16 Microsoft Technology Licensing, Llc Root cause investigation of site speed performance anomalies
CN113326161B (en) * 2021-06-01 2024-02-06 深圳前海微众银行股份有限公司 Root cause analysis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888755A (en) * 2019-11-15 2020-03-17 亚信科技(中国)有限公司 Method and device for searching abnormal root node of micro-service system
CN112087334A (en) * 2020-09-09 2020-12-15 中移(杭州)信息技术有限公司 Alarm root cause analysis method, electronic device and storage medium
CN112118141A (en) * 2020-09-21 2020-12-22 中山大学 Communication network-oriented alarm event correlation compression method and device

Also Published As

Publication number Publication date
CN113326161A (en) 2021-08-31
WO2022252512A1 (en) 2022-12-08

Similar Documents

Publication Publication Date Title
CN113326161B (en) Root cause analysis method
US10649838B2 (en) Automatic correlation of dynamic system events within computing devices
US9471462B2 (en) Proactive risk analysis and governance of upgrade process
US8965914B2 (en) Grouping identity records to generate candidate lists to use in an entity and relationship resolution process
EP1667062A1 (en) Resource reconciliation
CN108494622A (en) Monitoring rules configuration method, device and server
JP2019523952A (en) Streaming data distributed processing method and apparatus
CN111078513B (en) Log processing method, device, equipment, storage medium and log alarm system
CN109308290B (en) Efficient data cleaning and converting method based on CIM
CN111680104B (en) Data synchronization method, device, computer equipment and readable storage medium
CN111258798A (en) Fault positioning method and device for monitoring data, computer equipment and storage medium
CN107066522B (en) Database access method and device
US10866875B2 (en) Storage apparatus, storage system, and performance evaluation method using cyclic information cycled within a group of storage apparatuses
CN111476685A (en) Behavior analysis method, device and equipment
CN112579558A (en) Method, device, storage medium and equipment for displaying topological graph
CN112416974A (en) Data processing method, device and equipment and readable storage medium
CN115543918A (en) File snapshot method, system, electronic equipment and storage medium
CN115061841A (en) Alarm merging method and device, electronic equipment and storage medium
CN112433888B (en) Data processing method and device, storage medium and electronic equipment
CN108959367A (en) Management method, system, platform, medium and the electronic equipment of room basis data
CN114138615A (en) Service alarm processing method, device, equipment and storage medium
CN113326064A (en) Method for dividing business logic module, electronic equipment and storage medium
CN111552703A (en) Data processing method and device
CN112667281B (en) Configuration information processing method and device
CN108932305A (en) A kind of data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant