CN114205222A - Fault network element positioning method and system and computer readable storage medium - Google Patents

Fault network element positioning method and system and computer readable storage medium Download PDF

Info

Publication number
CN114205222A
CN114205222A CN202010904574.7A CN202010904574A CN114205222A CN 114205222 A CN114205222 A CN 114205222A CN 202010904574 A CN202010904574 A CN 202010904574A CN 114205222 A CN114205222 A CN 114205222A
Authority
CN
China
Prior art keywords
network element
calling
node
determining
call
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010904574.7A
Other languages
Chinese (zh)
Inventor
陈力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN202010904574.7A priority Critical patent/CN114205222A/en
Publication of CN114205222A publication Critical patent/CN114205222A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a fault network element positioning method, a system and a computer readable storage medium thereof, wherein the fault network element positioning method comprises the following steps: acquiring a service call chain corresponding to the service request; determining an abnormal network element calling pair in the service calling chain; determining an abnormal call subchain according to the abnormal network element call pair; and determining a fault root cause network element according to the abnormal call subchain and a preset rule. According to the scheme provided by the embodiment of the invention, the automatic positioning of the fault root cause network element can be realized in the service call chain corresponding to the service request, and the troubleshooting efficiency is effectively improved.

Description

Fault network element positioning method and system and computer readable storage medium
Technical Field
The present invention relates to, but not limited to, the field of micro service technologies, and in particular, to a method and a system for locating a faulty network element, and a computer-readable storage medium.
Background
Compared with the traditional single architecture, the micro-service is constructed based on the business capability, different micro-services can be realized through different programming technologies and can also depend on different external storages, the processing of the business request is realized through the co-cooperation of a plurality of micro-services, and the development and maintenance of the system are greatly simplified. With the increase of services, the number of network elements and nodes bearing micro services in the system is more and more, and the call chain of the micro services is more and more complex. For the convenience of tracing, a distributed link tracing technology is usually adopted, different call chains are distinguished through a tracing identifier (TraceID), and then the parent-child call relation of the microservice is recorded in the call chains through a span number (SpanID).
When a network element in the system breaks down, the abnormal time period of a service request can be quickly determined through a service gold index, a service expert searches the traceID in an abnormal log, a specific calling chain is inquired through the traceID, and the location of the broken-down network element is realized through a manual analysis method. Although fault location can be realized, the location method adopting manual analysis is very dependent on experience of service experts, and the time are more when a call chain is more complex, so that the service recovery time is too long.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the invention provides a fault network element positioning method, a fault network element positioning system and a computer readable storage medium, which can realize automatic positioning of a fault network element in a call chain and improve the efficiency of troubleshooting.
In a first aspect, an embodiment of the present invention provides a method for locating a faulty network element, including:
acquiring a service call chain corresponding to the service request;
determining an abnormal network element calling pair in the service calling chain;
determining an abnormal call subchain according to the abnormal network element call pair;
and determining a fault root cause network element according to the abnormal call subchain and a preset rule.
In a second aspect, an embodiment of the present invention further provides a system for locating a faulty network element, including: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for locating a faulty network element as described above when executing the computer program.
In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions, where the computer-executable instructions are used to execute the method for locating a faulty network element as described above
The embodiment of the invention comprises the following steps: acquiring a service call chain corresponding to the service request; determining an abnormal network element calling pair in the service calling chain; determining an abnormal call subchain according to the abnormal network element call pair; and determining a fault root cause network element according to the abnormal call subchain and a preset rule. According to the scheme provided by the embodiment of the invention, the automatic positioning of the fault root cause network element can be realized in the service call chain corresponding to the service request, and the troubleshooting efficiency is effectively improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.
Fig. 1 is a flowchart of a method for locating a faulty network element according to an embodiment of the present invention;
fig. 2 is a flowchart of determining an abnormal network element invocation pair in the method for locating a faulty network element according to another embodiment of the present invention;
fig. 3 is a schematic diagram of a service call chain of an abnormal time period in a method for locating a faulty network element according to another embodiment of the present invention;
fig. 4 is a flowchart of determining an abnormal network element call pair according to a reference call pair in the method for locating a faulty network element according to another embodiment of the present invention;
fig. 5 is a flowchart of determining an abnormal call subchain in the method for locating a faulty network element according to another embodiment of the present invention;
fig. 6 is a schematic diagram of an exception call subchain in the method for locating a faulty network element according to another embodiment of the present invention;
fig. 7 is a flowchart of determining a fault root cause network element in a method for locating a faulty network element according to another embodiment of the present invention;
fig. 8 is a flowchart of determining a network element confusion degree of a calling node in a method for locating a faulty network element according to another embodiment of the present invention;
fig. 9 is a flowchart illustrating determining a network element confusion degree of a calling node according to a network element ratio value in a method for locating a faulty network element according to another embodiment of the present invention;
fig. 10 is a flowchart of determining a root cause network element in a method for locating a failed network element according to another embodiment of the present invention;
fig. 11 is a flowchart illustrating determining a root cause network element according to a preset rule in a method for locating a failed network element according to another embodiment of the present invention;
fig. 12 is a flowchart of traversing an exception call subchain to determine a fault root network element in the fault network element locating method according to another embodiment of the present invention;
fig. 13A is a schematic diagram of a calling node of a calling layer in an abnormal child chain in the method for locating a faulty network element according to another embodiment of the present invention;
fig. 13B is a schematic network element diagram of a calling layer in an abnormal child chain in the method for locating a faulty network element according to another embodiment of the present invention;
fig. 14 is a schematic diagram of an apparatus of a system for locating a failed network element according to another embodiment of the present invention;
fig. 15 is a schematic device diagram of a system for locating a failed network element according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The invention provides a fault network element positioning method, a system thereof and a computer readable storage medium, wherein the fault network element positioning method comprises the following steps: acquiring a service call chain corresponding to the service request; determining an abnormal network element calling pair in the service calling chain; determining an abnormal call subchain according to the abnormal network element call pair; and determining a fault root cause network element according to the abnormal call subchain and a preset rule. According to the scheme provided by the embodiment of the invention, the automatic positioning of the fault root cause network element can be realized in the service call chain corresponding to the service request, and the troubleshooting efficiency is effectively improved.
The embodiments of the present invention will be further explained with reference to the drawings.
As shown in fig. 1, fig. 1 is a flowchart of a method for locating a faulty network element according to an embodiment of the present invention, where the method for locating a faulty network element includes, but is not limited to, step S110, step S120, step S130, and step S140.
Step S110, a service call chain corresponding to the service request is obtained.
In an embodiment, the service request may be any request in any service, and may have a continuous call chain, and the micro service module called according to the call chain may implement a specific function, for example, the service request may be a service request implementing a single function, or a service request implementing one sub-service of a multi-function service.
In an embodiment, the calling of the micro service cannot be traced back in the service execution process, so the method for locating the fault network element in this embodiment can be used for mining data in a service operation log after the service is stopped executing, so as to obtain a fault root network element, thereby solving the problem of low efficiency of fault location depending on manpower in the prior art.
In an embodiment, since the service request may include any number of service call chains, in order to ensure that the root cause of the faulty network element is found, all the service call chains in the service request may be acquired to form a service call chain set, and then a specific faulty root cause network element is located from the service call chain set, the service call chains may also be distinguished into normal call chains and abnormal call chains by using some existing data indexes, and the location of the faulty root cause network element is performed only for the abnormal call chains, thereby reducing the amount of computation.
Step S120, determine an abnormal network element calling pair in the service calling chain.
In an embodiment, the abnormal network element calling pair is a network element calling pair which is used in a service calling chain and the operation index of which does not conform to a normal numerical value, and for most micro services, the network element is usually borne by one network element, so that when one network element fails, due to layer-by-layer calling of the micro services, functions of the same calling chain with the failed network element are usually affected by correlation, and therefore, the number of the abnormal network element pairs in the service calling chain is usually large, and in order to ensure accurate positioning of the failure root cause network element, all determinable abnormal network element calling pairs in the service calling chain can be obtained and form a set for further analysis.
And step S130, determining the abnormal call subchain according to the abnormal network element call pair.
Based on the above embodiment, when one network element bearing micro-services fails, the micro-services in the same call chain as the failed network element are all affected to a certain extent, so that the abnormal network element calls a sub-chain capable of forming continuous abnormal calls. It should be noted that the case where multiple faulty network elements occur in the same call chain is not in the scope of the present disclosure, and if multiple network elements fail, the factor causing the abnormal call of the network element is too many, and a unique faulty network element cannot be located. Of course, if a service request has multiple service call chains and each service call chain has a faulty network element, the faulty network element location method according to the embodiment of the present invention may also be used to separately locate the faulty root network element of each service call chain.
In an embodiment, the abnormal call subchain is determined according to the abnormal network element call pair, the abnormal call subchain may be formed by splicing according to the parent-child relationship of the network element call pair, or the call chain in which the abnormal network element call pair is located may be directly determined as the abnormal call subchain, and a specific determination manner may be selected according to an actual requirement, which is not limited in this embodiment.
And step S140, determining a fault root cause network element according to the abnormal call subchain and a preset rule.
In an embodiment, after determining the sub-chain called abnormally, it may be determined that the fault root cause network element is definitely located in the sub-chain called abnormally, and therefore, the fault root cause network element may be determined according to any preset rule, for example, the network elements may be eliminated one by one based on a traversal sequence, or may be screened out based on data according to a manner of setting a plurality of thresholds, and a specific manner may be set according to an actual requirement, and it is sufficient to determine one fault root cause network element from the sub-chain called abnormally.
Based on the above embodiment, the anomaly of the fault root cause network element may be diffused in the service call chain, and the diffusion direction of the anomaly is random, for example, a network element in the middle of the call chain fails, and the network elements on both sides of the call chain may sequentially call the faulty network element, so that the generated anomaly influence may be diffused along with the call chain, and therefore, the network element where the anomaly influence is converged may be determined in the anomaly call sub-chain according to a preset rule, which may be specifically characterized by a disordered number of network elements in the call node, that is, a high degree of confusion, or may be further characterized by determining a micro-service in which an anomaly specifically occurs in the anomaly sub-chain according to the preset rule, and setting the network element bearing the micro-service as the fault root cause network element, where a specific manner may be adjusted according to the preset rule, and this embodiment does not make too many limitations.
In addition, referring to fig. 2, in an embodiment, the step S120 in the embodiment shown in fig. 1 further includes, but is not limited to, the following steps:
step S210, determining a calling abnormal time period, and acquiring a calling chain set of which the calling time in a service calling chain accords with the calling abnormal time period;
step S220, obtaining each network element calling pair in the calling chain set;
step S230, obtaining a reference calling pair with the same calling relation as the network element calling pair;
step S240, determining an abnormal network element calling pair according to the network element calling pair and the reference calling pair.
In an embodiment, in a process of service operation, the same service request may be executed multiple times, and therefore, the determination of the abnormal invocation time period may be in any manner, for example, a detection model is trained by using data of normal operation of the service request, and the abnormal invocation time period is directly identified by the detection model.
In an embodiment, the abnormal time period may be any length, for example, the detection model in the above embodiment is used to accurately match the time period in which the abnormality occurs, or in consideration of the time required for calling the network element, after the time period is matched according to the detection model, a period of time is obtained in the forward and backward directions to form the abnormal time period, so as to ensure the integrity of the data. It should be noted that, for one service request, a plurality of abnormal time periods may be matched through the detection model, and in this case, the fault root cause network elements may be located one by one for each abnormal time period, and after a plurality of fault root cause network elements are located, further analysis is performed through the prior art to obtain a specific reason for the fault in the system, which is not described herein again.
Based on the above embodiment, since there may be multiple abnormal time periods in the service request, the call chain set in step S210 is the call chain set in the same abnormal time period. It should be noted that the failure root cause network elements in different abnormal time periods may be different, and the embodiment of the present invention only addresses the location of the network element with one failure root cause in the service request, so that only one failure root cause network element is in the same default call chain set in the subsequent embodiment, which is not described in detail later. It should be noted that, since a service request in an abnormal time period may have a plurality of service call chains, and not all service call chains are abnormal, in order to ensure the integrity of data, in this embodiment, a call chain set is determined in the abnormal time period, and the call chain set is further analyzed, so as to locate a fault root network element.
In an embodiment, referring to fig. 3, fig. 3 is a schematic diagram of a service call chain in an abnormal time period in a method for locating a faulty network element according to another embodiment of the present invention, in the service call chain, each layer of call layer may have a plurality of nodes, and each node is sequentially connected according to a parent-child relationship, so as to form a plurality of call child chains, that is, a call chain set according to the above embodiment. It should be noted that the naming and arrangement of the nodes in fig. 3 are only used as examples, and do not limit the embodiment of the present invention. It should be noted that, for convenience of description, in the embodiment of the present invention, a naming mode of a node is "micro service @ network element", for example, a node "csf _001@ docker _ 003", and a meaning of the node name is: the name of the micro service is "csf _ 001", and the name of the network element bearing the micro service is "docker _ 003", and if not specifically stated, the same naming mode is adopted in the following examples.
In an embodiment, after each network element call pair in the call chain set is obtained in step S220, the call chain set may be stored in a form of a list or a data dictionary, for example, according to the call chain set shown in fig. 3, the call chain set may be stored in a form shown in table 1, which is convenient for performing subsequent one-to-one matching, and a specific recording manner may be selected according to actual requirements, which is not limited in this embodiment.
Figure BDA0002660949030000051
Table 1 call chain set save example
In an embodiment, since the service request also has a call normal time period, and the call chains in the call normal time period all work normally, the network element call pair in the call normal time period can be acquired as a reference call pair, and a data basis is provided for matching of the abnormal network element call pair, for example, a parent-child call relationship of the network element call pair can be acquired, and then a call pair with the same parent-child call relationship is selected from the network element call pair in the normal time period as a reference call pair.
In addition, referring to fig. 4, in an embodiment, the step S240 in the embodiment shown in fig. 2 further includes, but is not limited to, the following steps:
step S241, obtaining a calling index value of a network element calling pair;
step S242, obtaining a reference call index value of the reference call pair;
and step S243, when the difference value between the calling index value and the reference calling index value is greater than or equal to a preset calling index threshold value, determining the network element calling pair as an abnormal network element calling pair.
In an embodiment, the call index value may be any value capable of embodying the call performance, for example, a common gold index including a service request number, a success rate, service time consumption, and the like, and the specific call index value is selected according to an actual requirement, and a normal network element call pair and an abnormal network element call pair can be numerically distinguished.
Based on the embodiment shown in fig. 3, since the reference call pair is a call pair of the network element in the normal time period, the call index value of the reference call pair can be used as the reference call index value, and the call index value of the call pair of the network element in the abnormal call time period is compared with the reference index value, so that whether the call pair of the network element is an abnormal call pair of the network element can be determined. It should be noted that, the above-mentioned invoking index value and the reference index value may be compared in any statistical manner, such as an average value, a maximum value, a minimum value, and the like, for example, the invoking index value and the average value of a plurality of reference index values are compared, and if a difference between the two values is greater than a preset threshold, the network element invoking pair is determined to be an abnormal network element invoking pair.
In addition, referring to fig. 5, in an embodiment, the step S130 in the embodiment shown in fig. 1 further includes, but is not limited to, the following steps:
step S310, obtaining calling parent-child relation of each abnormal network element calling pair in calling abnormal time period;
step S320, determining the abnormal calling subchain according to the calling parent-child relationship.
In an embodiment, since each microservice in the service invocation chain is invoked in sequence, except for the root node and the bottom node, each network element may be invoked by another network element while invoking one network element, and it can be understood by those skilled in the art that the invocation relationship is an invocation parent-child relationship, where the invocation node is a parent invocation node, and the invocation node is an invocation child invocation node, and the principle is the same in the invocation node, and this embodiment is not described again. It should be noted that, because each abnormal network element calling pair in the service calling chain has a network element with a parent-child relationship, after the abnormal network element calling pair is determined, the abnormal network element calling pairs are spliced according to the parent-child relationship, so that an abnormal calling subchain, for example, the abnormal calling subchain shown in fig. 6, can be obtained, where the abnormal calling subchain is formed by some network element calling pairs in the calling chain set shown in fig. 3, that is, after the network element shown in fig. 6 is determined to be the abnormal network element calling pair, the abnormal calling subchain shown in fig. 6 can be obtained by splicing.
In addition, referring to fig. 7, in an embodiment, the step S140 in the embodiment shown in fig. 1 further includes, but is not limited to, the following steps:
step S410, determining a calling layer in the abnormal calling subchain and a calling node in the calling layer, and acquiring node network element information of the calling node;
step S420, determining the network element chaos of the calling node in the calling layer according to the node network element information;
step S430, determining a root cause network element of a calling layer according to the network element chaos and a preset rule;
step S440, determining a fault root cause network element in the abnormal calling subchain according to the root cause network element of the calling layer.
In an embodiment, the calling layer is a calling level in the calling chain, the calling nodes may be divided into parent calling nodes and child calling nodes according to calling parent-child relationships, for example, the exception call child chain shown in fig. 6 is distinguished according to the calling layer, the obtained calling layers of the parent calling nodes and the child calling nodes may be as shown in fig. 13A, the exception call child chain includes three calling layers, for example, the parent calling node of the first layer is "None", the child calling node is "osb _001@ os _ 021", and so on, which is not described herein again.
In an embodiment, the network element chaos may be used to express the number and the degree of mixing of network elements in the calling node, and the common network element chaos may be information entropy and degree of purity of kini, and a specific type may be selected according to actual requirements.
In an embodiment, since the calling node in each calling layer is called according to the parent-child relationship, a node causing a fault may be determined according to the network element confusion value, so as to determine the corresponding root cause network element, but since the root cause network element is only the root cause of the calling layer, and is not the root cause that causes the whole calling child chain to be abnormal, in this embodiment, after the root cause network element is determined, the fault root cause network element needs to be further determined according to the root cause network element of each calling layer. It should be noted that, after a network element fails, the failure is gradually propagated to other network elements in the call chain, and a network element calling a failure root cause network element may call a plurality of other network elements due to the failure, thereby further causing a subsequent call error, so that as the failure root cause network element, the network element chaos is relatively low, that is, according to the root cause network element of each call layer, the root cause can be finally converged to one network element, that is, the failure root cause network element.
Additionally, referring to fig. 8, in an embodiment, step S410 in the embodiment shown in fig. 7 further includes, but is not limited to, the following steps:
step S510, determining the total number of the network elements of a calling layer where the calling node is located and the number of the same type network elements of various network elements in the calling layer where the calling node is located according to the node network element information;
step S520, obtaining the network element ratio of the number of the same type of network elements to the total number of the network elements;
step S530, determining the network element chaos degree of the calling node in the calling layer where the calling node is located according to the network element proportion value.
In an embodiment, since the network element confusion is a degree of mixing of network elements in the calling node, in order to better reflect the degree of mixing, in combination with the fact that multiple network elements may exist in the calling node, this embodiment may first calculate a network element proportion value, that is, the proportion of various network elements in a calling layer of one calling node.
The following describes, by way of a specific example, the method of this embodiment with reference to fig. 13B, where fig. 13B is a schematic diagram of network elements in a parent call node and a child call node shown in fig. 13A, and the following examples take the parent call node in the call layer shown in fig. 13B as an example, and the calculation method of the child network element is the same, and is not repeated:
as shown in fig. 13B, in the first layer of the parent calling node, there is one network element "none", so that the number of the network elements of the same type is 1, the total number of the network elements is 1, and the ratio of the network elements in the first layer is 1/1-1. In the second layer of the parent calling node, there are two network elements, and both are "os _ 022", so the number of the network elements of the same type is 2, the total number of the network elements is 2, and the ratio of the network elements in the second layer is 2/2 ═ 1. In the third layer of the parent calling node, eight network elements are provided, wherein the eight network elements include four "docker _ 002" and four "docker _ 001", so that the number of the same type of network elements is 4, the total number of the network elements is 8, and the ratio of the network elements in the third layer is 4/8 ═ 0.5.
In addition, referring to fig. 9, in an embodiment, step S530 in the embodiment shown in fig. 8 further includes, but is not limited to, the following steps including:
step S610, when the total number of the network elements is less than or equal to a preset network element number threshold and the maximum value in the network element ratio is greater than or equal to a preset ratio threshold, determining the network element chaos of the calling node in the calling layer where the calling node is located as low;
alternatively, the first and second electrodes may be,
step S620, when the total number of the network elements is less than or equal to a preset network element number threshold and the maximum value of the network element ratio is less than a preset ratio threshold, determining the network element chaos of the calling node in the calling layer where the calling node is located to be high;
alternatively, the first and second electrodes may be,
step S630, when the total number of the network elements is greater than the preset network element number threshold, obtaining a mapping relation between the network element ratio and the preset network element chaos, and determining the network element chaos of the calling node in the calling layer where the calling node is located according to the mapping relation and the network element ratio.
In an embodiment, specific values of the network element number threshold and the percentage threshold may be adjusted according to actual requirements, which is not limited in this embodiment. It should be noted that, the step S610 and the step S620 only need to make a judgment according to the magnitude relationship of the quantity, and do not need to involve other calculations besides the network element proportion value.
In an embodiment, the mapping relationship may be a formula for calculating the degree of confusion according to the ratio of the network elements, for example, when the degree of confusion is the degree of confusion, the mapping relationship may adopt a formula for calculating the degree of confusion in the prior art:
Figure BDA0002660949030000071
Figure BDA0002660949030000072
wherein p iskAnd k is the network element occupation ratio of the kth network element, and k is the number of the network elements in the calling node. It should be noted that, the purer the kini purity is, the less the network element is disordered, so that the pure the kini purity is, the lower the disorder degree of the corresponding network element is, and the impure the kini purity is, the higher the disorder degree of the corresponding network element is.
The following describes calculation of the network element confusion degree of the parent call node of the call layer shown in fig. 13B by using a specific example, where, for convenience of description, the threshold value of the number of network elements in this example is 5, and the value of the fractional threshold value is 0.8, and the network element confusion degree is a chinny degree, where the calculated chinny degree is a specific value, so that a purity threshold value may be set for determining the level of the network element confusion degree, and in this example, 0.3 is used as the purity threshold value:
it should be noted that, the calculation in the embodiment shown in fig. 8 may be referred to as the network element ratio value in each calling layer, and details are not described here again.
For the first layer calling layer, only one network element is smaller than the threshold of the number of the network elements, and the ratio of the network elements is 1, so that the degree of network element disorder is low;
for the second layer calling layer, the second layer calling layer comprises two network elements, the number of the network elements is smaller than the threshold value of the number of the network elements, and the ratio of the network elements is 1, so that the degree of network element disorder is low;
for the third calling layer, which contains eight network elements and is greater than the network element number threshold, the formula is adopted to calculate the degree of purity of the kini, the calling layer comprises two network element types, namely 'docker _ 001' and 'docker _ 002', the network element proportion values are both 0.5, and the degree of purity of the kini is
Figure BDA0002660949030000073
Wherein p is1A network element ratio, p, of a network element "docker _0012A network element occupation ratio of a network element "docker _ 002"; since 0.5 is greater than the purity threshold 0.3, the network element misordering is high.
Referring to the above calculation manner, the network element confusion of the sub-network element shown in fig. 13B may be further calculated, which is not described herein again.
It should be noted that, with the above calculation method, the network element misordering degrees of the parent call node and the child call node in fig. 13B are as follows
Shown in Table 2:
calling layer Father calling node Sub-calling node
First layer Is low in Is low in
Second layer Is low in Height of
Third layer Height of Is low in
TABLE 2 network element chaos for parent and child call nodes
In addition, referring to fig. 10, in an embodiment, the step S430 in the embodiment shown in fig. 7 further includes, but is not limited to, the following steps:
step S710, determining calling parent-child relationship among all calling nodes in a calling layer;
step S720, determining a father calling node and a son calling node in a calling layer according to the calling father-son relationship;
step S730, acquiring the network element chaos of a father calling node and the network element chaos of a son calling node;
step S740, determining a root cause network element of the calling layer according to the network element chaos of the parent calling node, the network element chaos of the child calling node, and the preset rule.
It should be noted that the parent call node and the child call node may be different nodes or the same node, for example, when one call node includes two network elements, and one network element of the call node needs to call the micro service carried in the other network element according to the service call chain, the parent call node and the child call node determined according to the call parent-child relationship are both the call node. As will be understood by those skilled in the art, calling parent-child relationships is for each network element calling pair, for example, referring to fig. 6 and 13A, for the exception call child chain shown in fig. 6, in the first level calling layer, the calling node "osb _001@ os _ 022" is the head node, so that the node has no parent calling node to call, and the parent calling node is set to "None"; in the second level call layer, the call node "osb _001@ os _ 022" is called by the call nodes "csf _001@ docker _ 001" and "csf _001@ docker _ 002", respectively, so that at the second level of the call layer, there are two pairs of parent-child relationships, respectively: the parent call node "osb _001@ os _ 022" and the child call node "csf _001@ docker _ 001", the parent call node "osb _001@ os _ 022" and the child call node "csf _001@ docker _ 002", and the relationship of the third layer call layer is analogized in sequence, and the description is omitted here.
In an embodiment, referring to fig. 13A and fig. 13B, a parent calling node and a child calling node may be determined according to a specific calling manner of a network element, for example, in this embodiment, a network element corresponding to a calling node "osb _001@ os _ 022" is "os _ 022", and based on this, according to a parent-child relationship between calling nodes of a calling layer as shown in fig. 13A, a corresponding relationship between the parent calling node and the child calling node as shown in fig. 13B may be obtained. It can be understood by those skilled in the art that, in this embodiment, one network element is provided in one calling node as an example, and when a plurality of network elements are provided in the calling node, a parent calling node and a child calling node are determined one by one according to a calling relationship between the network elements, which is not described herein again.
It should be noted that the network element chaos of the parent call node and the child call node may be obtained separately, and the network element chaos is an attribute of the network element itself, so that the network element chaos is not affected by the parent-child call relationship.
In an embodiment, in step S740, the network element chaos of the parent call node and the network element chaos of the child call node may be respectively calculated, and then the root cause network element of the call layer is determined according to a preset rule, and if a mode of traversing the call layer by layer is adopted, the calculation may be performed when the call layer is traversed, and the adjustment may be performed according to actual requirements.
In addition, referring to fig. 11, in an embodiment, the preset rule may include the following rule, and any one of the following rules may be satisfied:
when the network element chaos degree of the father calling node is low and the network element chaos degree of the son calling node is high, updating the root cause network element into the network element in the father calling node;
alternatively, the first and second electrodes may be,
when the network element chaos degree of the father calling node is high and the network element chaos degree of the son calling node is low, updating the root cause network element into the network element in the son calling node;
alternatively, the first and second electrodes may be,
when the network element chaos degrees of the father calling node and the son calling node are both low, updating the root cause network element into the network element in the son calling node;
alternatively, the first and second electrodes may be,
and when the network element chaos degrees of the father calling node and the son calling node are both high, the root cause network element is not updated.
In an embodiment, the preset rule may be formulated according to an adopted traversal manner, and it is sufficient to determine a fault root cause network element of the abnormal call child chain according to the root cause network element of each call layer, for example, a manner of traversing the call layers from a top layer and updating the root cause network elements layer by layer is adopted, the preset rule described in this embodiment may be adopted, and a person skilled in the art has an incentive to adjust the preset rule according to the traversal manner and the network element chaos, which is not described herein again.
For example, referring to fig. 13B, the network element in the parent calling node of the first-layer calling layer is "None", and the network element in the child calling node is "os _ 022"; the network element in the parent calling node of the second layer calling layer is "os _ 022", and the network elements in the child calling nodes are "docker _ 002" and "docker _ 001"; the network elements in the parent calling node of the third calling layer are "docker _ 002" and "docker _ 001", the network element in the child calling node is "docker _ 008", referring to the confusion degrees of the parent calling node and the child calling node shown in table 2, the root cause network element of each calling layer can be obtained as shown in table 3:
Figure BDA0002660949030000091
table 3, calling layer root cause network element schematic table
It should be noted that, according to the method for calculating the network element chaos in the foregoing embodiment, the network element chaos of the calling node with a larger number of network elements can be set to be high by setting the chaos threshold, and when the network element chaos of the parent calling node and the child calling node is both high, the root cause network element is not updated, and the next pair of network element calling pairs is used for further judgment.
Referring to fig. 12, in an embodiment, step S440 in the embodiment shown in fig. 7 further includes, but is not limited to, the following steps:
step S810, traversing the calling layer in the abnormal calling subchain;
step S820, determining a fault root network element in the abnormal calling child chain according to the root network element of the calling layer.
In an embodiment, in the service call chain, a failed network element may cause performance of the entire call chain to be affected, so that the failed root network element cannot be determined simply through index data, and based on this, all network element call pairs of the abnormal call subchain need to be traversed at least once to determine the failed root network element. It should be noted that the direction of traversal may be selected according to actual requirements, for example, from the top of the calling layer, from the bottom of the calling layer, or from the middle of the calling layer. For example, by using the preset rule of the present embodiment, it is possible to implement the positioning of the fault root network element by traversing from the top layer.
The top-level start traversal is illustrated below in conjunction with table 3 and fig. 13B:
according to the root cause network elements of each calling layer obtained in table 3, after traversing the first layer, the candidate fault root cause network element is "os _ 022", after traversing the second layer, the candidate fault root cause network element is kept at "os _ 022", after traversing the third layer, the candidate fault root cause network element is updated to "docker _ 008", and at this time, the traversal is completed, so that it is determined that "docker _ 008" is the fault root cause network element in the abnormal calling subchain.
As can be understood by those skilled in the art, after the fault root cause network element is located, further excavation and analysis of the fault cause of the fault root cause network element are required, the excavation and analysis method is not an improvement made in the embodiment, and the technical scheme of the present invention only needs to complete the location of the fault root cause network element, instead of the manual location method in the prior art, so as to improve the location efficiency of the fault root cause network element.
In addition, referring to fig. 14, an embodiment of the present invention further provides a system for locating a faulty network element, including: a service request abnormity detection module and a fault network element positioning module. The service request anomaly detection module is configured to detect an anomaly occurring in a service request, for example, execute step S110 in the embodiment shown in fig. 1, and may also be configured to execute detailed steps thereof, for example, step S210 in the embodiment shown in fig. 2. The faulty network element location module includes an exception invoking sub-chain identification module and a root cause network element acquisition module, where the exception invoking sub-chain identification module is used to identify and acquire an exception sub-chain in the service request, for example, to execute step S120 and step S130 in the embodiment shown in fig. 1, and may also be used to execute corresponding detailed steps, for example, steps S220 to step S240 in the embodiment shown in fig. 2, method steps S310 and S320 in fig. 5, and method steps S710 and S720 in fig. 10; the root cause network element obtaining module is configured to determine a faulty root cause network element in the sub chain identified by the sub chain identification module, for example, execute step S140 in the embodiment shown in fig. 1, and may also be configured to execute corresponding detailed steps, for example, method steps S241 to S243 in fig. 4, method steps S410 to S440 in fig. 7, method steps S510 to S530 in fig. 8, and method steps S610 to S630 in fig. 9.
It should be noted that the fault network element positioning module may also be connected to a fault network element root cause diagnosis module for diagnosing a fault network element root cause, and the fault network element root cause diagnosis module may be a functional module for further fault cause mining and diagnosis of a positioned fault root cause network element in the prior art, and a specific method thereof is not in the improvement range of this embodiment, and is not described herein again.
It should be noted that each module of the above-mentioned faulty network element positioning system may be a functional module that is arranged in a terminal, a server, and other devices that can be used for root cause diagnosis, and those skilled in the art have a motivation to select a specific hardware or software functional module according to actual needs, which is not limited in this embodiment.
In addition, referring to fig. 15, an embodiment of the present invention further provides a system 1000 for locating a faulty network element, where the system 1000 for locating a faulty network element includes: a memory 1020, a processor 1010, and a computer program stored on the memory 1020 and executable on the processor 1010.
The processor 1010 and the memory 1020 may be connected by a bus or other means.
Non-transitory software programs and instructions required to implement the method for locating a faulty network element of the above-described embodiment are stored in the memory 1020, and when executed by the processor 1010, perform the locating of a faulty network element of the above-described embodiment, for example, perform the above-described method steps S110 to S140 in fig. 1, method steps S210 to S240 in fig. 2, method steps S241 to S243 in fig. 4, method steps S310 to S320 in fig. 5, method steps S410 to S440 in fig. 7, method steps S510 to S530 in fig. 8, method steps S610 to S630 in fig. 9, method steps S710 to S740 in fig. 10, and method steps S810 to S820 in fig. 12.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Further, an embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions, the computer executable instructions are executed by a processor or controller, e.g., by a processor in the above-described embodiment of the faulty network element location system, the processor may be caused to execute the method for locating a faulty network element in the above embodiment, for example, to execute the above-described method steps S110 to S140 in fig. 1, method steps S210 to S240 in fig. 2, method steps S241 to S243 in fig. 4, method steps S310 to S320 in fig. 5, method steps S410 to S440 in fig. 7, method steps S510 to S530 in fig. 8, method steps S610 to S630 in fig. 9, method steps S710 to S740 in fig. 10, and method steps S810 to S820 in fig. 12. One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims (12)

1. A method for positioning a fault network element comprises the following steps:
acquiring a service call chain corresponding to the service request;
determining an abnormal network element calling pair in the service calling chain;
determining an abnormal call subchain according to the abnormal network element call pair;
and determining a fault root cause network element according to the abnormal call subchain and a preset rule.
2. The method as claimed in claim 1, wherein said determining an abnormal network element call pair in said service call chain comprises:
determining a calling abnormal time period, and acquiring a calling chain set of which the calling time in the service calling chain accords with the calling abnormal time period;
acquiring each network element calling pair in the calling chain set;
acquiring a reference calling pair with the same calling relation as the network element calling pair;
and determining an abnormal network element calling pair according to the network element calling pair and the reference calling pair.
3. The method according to claim 2, wherein said determining an abnormal network element invocation pair according to said network element invocation pair and said reference invocation pair comprises:
obtaining a calling index value of the network element calling pair;
acquiring a reference calling index value of the reference calling pair;
and when the difference value of the calling index value and the reference calling index value is greater than or equal to a preset calling index threshold value, determining the network element calling pair as an abnormal network element calling pair.
4. The method according to claim 2 or 3, wherein the determining the sub-chain of the abnormal call according to the abnormal network element call pair comprises:
acquiring the calling parent-child relationship of each abnormal network element calling pair in the calling abnormal time period;
and determining the abnormal call subchain according to the call parent-child relationship.
5. The method according to claim 1, wherein the determining the fault root cause network element according to the abnormal call subchain and the preset rule comprises
Determining a calling layer in the abnormal calling subchain and a calling node in the calling layer, and acquiring node network element information of the calling node;
determining the network element chaos of the calling node in the calling layer according to the node network element information;
determining a root cause network element of the calling layer according to the network element chaos and a preset rule;
and determining a fault root cause network element in the abnormal calling subchain according to the root cause network element of the calling layer.
6. The method as claimed in claim 5, wherein said determining the network element misordering of the calling node in the calling layer according to the node network element information comprises:
determining the total number of the network elements of a calling layer where the calling node is located and the number of the same type network elements of various network elements in the calling layer where the calling node is located according to the node network element information;
acquiring the network element ratio of the number of the same type of network elements to the total number of the network elements;
and determining the network element chaos of the calling node in the calling layer where the calling node is located according to the network element ratio.
7. The method as claimed in claim 6, wherein the determining the network element confusion degree in the calling layer where the calling node is located according to the network element ratio value comprises:
when the total number of the network elements is less than or equal to a preset network element number threshold and the maximum value in the network element ratio is greater than or equal to a preset ratio threshold, determining the network element chaos of the calling node in the calling layer where the calling node is located to be low;
alternatively, the first and second electrodes may be,
when the total number of the network elements is less than or equal to a preset network element number threshold value, and the maximum value in the network element ratio value is less than a preset ratio threshold value, determining that the network element chaos degree of the calling node in a calling layer where the calling node is located is high;
alternatively, the first and second electrodes may be,
and when the total number of the network elements is greater than the preset network element number threshold, obtaining a mapping relation between the network element proportion value and a preset network element chaos degree, and determining the network element chaos degree of the calling node in a calling layer where the calling node is located according to the mapping relation and the network element proportion value.
8. The method as claimed in claim 7, wherein said determining a root cause network element of the calling layer according to the network element chaos and the preset rule comprises:
determining calling parent-child relations among all calling nodes in the calling layer;
determining a father calling node and a child calling node in the calling layer according to the calling father-son relationship;
acquiring the network element chaos of the father calling node and the network element chaos of the son calling node;
and determining the root cause network element of the calling layer according to the network element chaos of the father calling node, the network element chaos of the son calling node and the preset rule.
9. The method as claimed in claim 8, wherein the preset rule comprises:
when the network element chaos degree of the father calling node is low and the network element chaos degree of the son calling node is high, updating the root cause network element into the network element in the father calling node;
alternatively, the first and second electrodes may be,
when the network element chaos of the father calling node is high and the network element chaos of the son calling node is low, updating the root cause network element into the network element in the son calling node;
alternatively, the first and second electrodes may be,
when the network element chaos degrees of the father calling node and the son calling node are both low, updating the root cause network element into the network element in the son calling node;
alternatively, the first and second electrodes may be,
and when the network element chaos degrees of the father calling node and the son calling node are both high, the root cause network element is not updated.
10. The method according to any one of claims 5 to 9, wherein the determining a faulty root cause network element in the abnormal call child chain according to the root cause network element of the call layer includes:
traversing a calling layer in the abnormal calling subchain;
and determining a fault root cause network element in the abnormal calling subchain according to the root cause network element of the calling layer.
11. A faulty network element location system, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of locating a faulty network element according to any one of claims 1 to 10 when executing the computer program.
12. A computer-readable storage medium storing computer-executable instructions for performing the method of any one of claims 1 to 10 for locating a faulty network element.
CN202010904574.7A 2020-09-01 2020-09-01 Fault network element positioning method and system and computer readable storage medium Pending CN114205222A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010904574.7A CN114205222A (en) 2020-09-01 2020-09-01 Fault network element positioning method and system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010904574.7A CN114205222A (en) 2020-09-01 2020-09-01 Fault network element positioning method and system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114205222A true CN114205222A (en) 2022-03-18

Family

ID=80644207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010904574.7A Pending CN114205222A (en) 2020-09-01 2020-09-01 Fault network element positioning method and system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114205222A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160134465A1 (en) * 2014-11-12 2016-05-12 Huawei Technologies Co., Ltd. Service Chain Management Method, Delivery Node, Controller, and Value-Added Service Node
CN108833184A (en) * 2018-06-29 2018-11-16 腾讯科技(深圳)有限公司 Service fault localization method, device, computer equipment and storage medium
CN110351136A (en) * 2019-07-04 2019-10-18 阿里巴巴集团控股有限公司 A kind of Fault Locating Method and device
CN110442641A (en) * 2019-08-06 2019-11-12 中国工商银行股份有限公司 A kind of link topology figure methods of exhibiting, device, storage medium and equipment
CN111597070A (en) * 2020-07-27 2020-08-28 北京必示科技有限公司 Fault positioning method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160134465A1 (en) * 2014-11-12 2016-05-12 Huawei Technologies Co., Ltd. Service Chain Management Method, Delivery Node, Controller, and Value-Added Service Node
CN108833184A (en) * 2018-06-29 2018-11-16 腾讯科技(深圳)有限公司 Service fault localization method, device, computer equipment and storage medium
CN110351136A (en) * 2019-07-04 2019-10-18 阿里巴巴集团控股有限公司 A kind of Fault Locating Method and device
CN110442641A (en) * 2019-08-06 2019-11-12 中国工商银行股份有限公司 A kind of link topology figure methods of exhibiting, device, storage medium and equipment
CN111597070A (en) * 2020-07-27 2020-08-28 北京必示科技有限公司 Fault positioning method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US8230269B2 (en) Monitoring data categorization and module-based health correlations
US11269718B1 (en) Root cause detection and corrective action diagnosis system
US9612892B2 (en) Creating a correlation rule defining a relationship between event types
US8655623B2 (en) Diagnostic system and method
US11138163B2 (en) Automatic root cause diagnosis in networks based on hypothesis testing
CN110502494A (en) Log processing method, device, computer equipment and storage medium
US20220148674A1 (en) Memory fault handling method and apparatus, device, and storage medium
US11030038B2 (en) Fault prediction and detection using time-based distributed data
US10177984B2 (en) Isolation of problems in a virtual environment
US8631280B2 (en) Method of measuring and diagnosing misbehaviors of software components and resources
US9122784B2 (en) Isolation of problems in a virtual environment
CN107678908B (en) Log recording method and device, computer equipment and storage medium
US20210065083A1 (en) Method for changing device business and business change system
CN110955550A (en) Cloud platform fault positioning method, device, equipment and storage medium
CN113657715A (en) Root cause positioning method and system based on kernel density estimation calling chain
CN110597655A (en) Fast predictive restoration method for coupling migration and erasure code-based reconstruction and implementation
US20170010948A1 (en) Monitoring a computing environment
CN116414661B (en) Processing method and device for solid state disk of distributed storage
CN114205222A (en) Fault network element positioning method and system and computer readable storage medium
CN115878052A (en) RAID array inspection method, inspection device and electronic equipment
US11914465B2 (en) Tool-guided computing triage probe
US20130173777A1 (en) Mining Execution Pattern For System Performance Diagnostics
CN116318386A (en) Failure prediction method of optical module, system and storage medium thereof
CN109388418A (en) Method and system with the outer firmware for refreshing BOX node server and FRU
CN112486771B (en) Distributed system management method, system, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination