CN115348147A - Fault analysis method, apparatus, device, storage medium and program product - Google Patents

Fault analysis method, apparatus, device, storage medium and program product Download PDF

Info

Publication number
CN115348147A
CN115348147A CN202110522995.8A CN202110522995A CN115348147A CN 115348147 A CN115348147 A CN 115348147A CN 202110522995 A CN202110522995 A CN 202110522995A CN 115348147 A CN115348147 A CN 115348147A
Authority
CN
China
Prior art keywords
fault
node
network card
causal relationship
leaf node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110522995.8A
Other languages
Chinese (zh)
Inventor
陈斌
陈功
韩见伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110522995.8A priority Critical patent/CN115348147A/en
Publication of CN115348147A publication Critical patent/CN115348147A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/0636Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the disclosure provides a fault analysis method, equipment, a device, a storage medium and a program product, and relates to the field of network communication, in particular to fault analysis of a network card device. For example, the present disclosure provides a fault analysis method. The method can include obtaining data to be analyzed of the network card device. In addition, the method can further comprise the step of judging the data to be analyzed, and if the data to be analyzed is determined to meet the fault judgment condition, determining the leaf node corresponding to the fault judgment condition in the causal relationship information. The causal relationship information is used to analyze the network card device for faults, and each causal relationship node in the causal relationship information indicates one fault of the network card device. Afterwards, the method may further include determining an associated node of the leaf node in the causal relationship information, where the leaf node and the associated node are both causal relationship nodes in the causal relationship information. In addition, the method may further include determining a failure analysis result for the network card device based at least on the failure indicated by the associated node. By the scheme, the fault of the network card equipment can be automatically detected and analyzed, so that various problems of manual operation and maintenance are solved.

Description

Fault analysis method, apparatus, device, storage medium and program product
Technical Field
Embodiments of the present disclosure generally relate to the field of information technology. More particularly, embodiments of the present disclosure relate to a method, an apparatus, a device, a computer-readable storage medium, and a computer program product for failure analysis of a network card device.
Background
When a network card device fails, for example, operation and maintenance personnel usually obtain a series of network index data and service operation data through various tools, and determine the location and the reason of the failure based on the obtained data and personal maintenance experience. With the increasing of system data, fault alarm and network data, the current operation and maintenance mechanism is challenged greatly.
Disclosure of Invention
In order to realize the automation of the failure analysis of the network card device, the embodiment of the disclosure provides a scheme for analyzing the failure of the network card device.
In a first aspect of the present disclosure, a fault analysis method is provided. The method comprises the following steps: acquiring data to be analyzed of the network card equipment; if the fact that the data to be analyzed meet the fault judgment condition is determined, leaf nodes corresponding to the fault judgment condition in the causal relationship information are determined, the causal relationship information is used for analyzing the fault of the network card device, and each causal relationship node in the causal relationship information indicates one fault of the network card device; determining an associated node of a leaf node in the causal relationship information, wherein the leaf node and the associated node are both causal relationship nodes in the causal relationship information; and determining a fault analysis result for the network card equipment based on the fault indicated by the associated node.
The method and the device correlate each data to be analyzed of the network card device with the fault of the leaf node in the causal relationship information, so that whether the fault indicated at the corresponding leaf node occurs or not can be judged based on one or more items of data in the data to be analyzed. Therefore, the method and the device can automatically detect and analyze the faults of the network card equipment, thereby solving a plurality of problems of manual operation and maintenance. On one hand, the fault detection process provided by the disclosure can traverse all possible faults, so that the accuracy and the comprehensiveness of the fault detection process can be improved. On the other hand, after the leaf node corresponding to the fault is determined, the method further deduces the upper layer associated node of the leaf node based on the causal relationship information, so that the root cause of the fault of the network card equipment can be accurately determined. Therefore, the troubleshooting process of the present disclosure simplifies troubleshooting difficulty for users such as operation and maintenance personnel, and reduces troubleshooting time.
In an implementation manner of the first aspect, the data to be analyzed of the network card device may be obtained by a user or a computing unit. Thereafter, it may be determined by the calculation unit whether the data to be analyzed satisfies the failure determination condition, and if the data to be analyzed satisfies the failure determination condition, the calculation unit may determine a leaf node in the causal relationship information corresponding to the failure determination condition. Further, the calculation unit may determine an associated node of the leaf node in the causal relationship information, and automatically determine and output a failure analysis result for the network card device based on a failure indicated by the associated node. Because the data to be analyzed usually has considerable data volume, the data to be analyzed can be monitored by utilizing the computing unit, so that the fault analysis result can be timely and accurately determined without manual intervention.
In one implementation manner of the first aspect, the causal relationship node further includes an additional leaf node, the association node is an upper node of the leaf node and the additional leaf node, and determining the association node may include: determining a logical relationship between the leaf node and the additional leaf node based on the causal relationship information; if the logical relationship is determined to be an OR, the computing unit may determine the associated node based on the leaf node. That is, when determining the logical relationship existing between the leaf node and the additional leaf node, since the failure of the leaf node is determined, the associated node as the upper node thereof can be directly determined. In addition, if it is determined that the logical relationship is an and, the calculation unit may further determine whether the data to be analyzed satisfies a failure determination condition corresponding to an additional leaf node, and if so, may determine an associated node as an upper node thereof based on the leaf node and the additional leaf node. That is, only if both leaf nodes satisfy the respective failure determination conditions, the associated node can be derived upward. By limiting the logical relation between the OR and the AND, the derivation precision of the upper-layer associated nodes can be improved, the diversity of the fault tree is enriched, and the accuracy of fault analysis is improved.
In one implementation of the first aspect, the association node may include a root node in the causal relationship information, and at least one node between the leaf node and the root node. In other words, a plurality of fault nodes can be layered, so that layer-by-layer derivation of faults can be realized, the accuracy of fault derivation can be improved, and the derivation difficulty is reduced.
In one implementation form of the first aspect, determining the fault analysis result based on at least the fault indicated by the associated node includes: determining a fault graph based on a first fault indicated by the associated node and a second fault indicated by the leaf node, wherein the fault graph comprises the first fault, the second fault and a causal relationship between the first fault and the second fault; and determining the fault map as the fault analysis result. By showing the fault analysis result to the user in the form of the fault graph, the user can more intuitively know the association relation between each fault node, and therefore faults can be more comprehensively analyzed.
In one implementation manner of the first aspect, the method may further include: comparing the fault analysis result with an actual analysis result, the actual analysis result being determined by a user; and if the fault analysis result is different from the actual analysis result, applying the actual analysis result to a decision condition update model to determine the updated fault decision condition, wherein the decision condition update model is trained based on a training data set, and the training data set comprises a reference analysis result and a labeled reference fault decision condition. In this way, it is possible to update the failure determination condition based on the actual analysis result when the failure analysis result determined by the calculation unit is not satisfactory, thereby making the subsequent failure analysis result more accurate.
In one implementation of the first aspect, the fault determination condition may include at least one of: the packet loss number of a gateway loopback short packet of the network card equipment is greater than a first threshold number; the gateway short packet loopback response time of the network card equipment is longer than the first threshold response time; the number of the lost packets of the gateway loopback long packets of the network card equipment is greater than the second threshold number; the loopback response time of the gateway long packet of the network card equipment is longer than the second threshold response time; the link loopback short packet loss number of the network card equipment is greater than the third threshold number; the loopback response time of the link short packet of the network card equipment is longer than the third threshold response time; the packet loss number of the link loopback long packet of the network card equipment is greater than the fourth threshold number; and the loopback response time of the link long packet of the network card equipment is greater than the fourth threshold response time. By defining the determination conditions of all the fault classifications concerning the loopback abnormal fault in detail, automatic and fine fault determination can be realized.
In one implementation of the first aspect, the above-mentioned fault determination condition is preloaded into a fault determination module of the computing unit. The failure determination condition may be a configuration file written in advance by a user, and the configuration file contains a plurality of specific failure determination conditions, which correspond to specific failures of the network card device.
In one implementation of the first aspect, a plurality of specific faults and corresponding fault determination conditions predetermined by a user based on experience or historical data may be received, and a profile of the fault determination conditions may be periodically optimized based on the specific faults and the corresponding fault determination conditions. In this way, it can be ensured that the failure determination condition is accurately set, thereby achieving accurate failure analysis.
In an implementation manner of the first aspect, the fault analysis method may further include: comparing the fault analysis result with an actual analysis result, the actual analysis result being determined by a user; and if the fault analysis result is different from the actual analysis result, applying the actual analysis result to a decision condition update model to determine the updated fault decision condition, wherein the decision condition update model may be trained based on a training data set, the training data set comprising the reference analysis result and the labeled reference fault decision condition. Therefore, optimization and updating of fault judgment conditions can be achieved, and accuracy of fault analysis results can be guaranteed.
In one implementation of the first aspect, the causal relationship information may be a fault tree for performing fault derivation operations. The fault tree is a tree-shaped logic causal relationship graph, and by utilizing the fault tree, the association nodes and the upper level nodes of the leaf nodes can be systematically arranged, so that a more complex fault linkage relationship can be defined.
In an implementation manner of the first aspect, at least one of the following data to be detected is detected: the gateway of the network card equipment loops back the number of the lost short packets; the gateway short packet loopback response time of the network card equipment; the gateway of the network card equipment loops back the number of lost packets of the long packet; the network card equipment gateway long packet loopback response time; the link loopback short packet loss number of the network card equipment; the link short packet loopback response time of the network card equipment; the link loopback long packet lost packet number of the network card equipment; and the link long packet loopback response time of the network card equipment. And if the at least one item of data to be detected is abnormal, determining that the at least one item of data to be detected meets corresponding fault judgment conditions. By monitoring the abnormal state of the related data, the fault can be accurately and timely positioned.
In one implementation of the first aspect, the failure analysis result for the network card device may be determined based only on the failure indicated by the associated node. The fault analysis result is determined based on the fault indicated by the associated node or at least part of the associated nodes, so that deeper motivation of the fault can be shown to users, particularly operation and maintenance personnel, and the solution of the fault can be determined quickly.
In one implementation of the first aspect, the fault analysis result may be determined based on faults indicated by the leaf nodes and the associated nodes. The fault analysis result is determined based on the faults indicated by the associated nodes and the leaf nodes, so that a comprehensive fault map can be displayed for users, particularly operation and maintenance personnel, and the fault analysis method is favorable for accurately positioning the faults.
In an implementation manner of the first aspect, the computing unit may at least implement hardware logic of the data acquisition module, the fault determination module, and the fault derivation module. Alternatively, the computing unit may implement the corresponding functions by reading software codes in which the data acquisition module, the fault determination module, and the fault derivation module are stored and executing the corresponding software codes or program instructions. In this way, the cause of the failure of the network card device can be automatically determined.
In one implementation of the first aspect, the data to be analyzed may include operation data and/or log data associated with the network card device. The operation data generally refers to the detected performance or other working parameters of the network card device. The log data generally refers to log event records generated in the operation process of the network card device. In this way, the computing unit can monitor the working state of the network card device more comprehensively, and particularly can detect a fault which can be judged only when the operation data and the log data are abnormal at the same time.
In one implementation of the first aspect, the mapping relationship between the leaf node in the fault tree and the fault determination condition may be loaded in the fault determination module of the computing unit together with the fault determination condition, or the mapping relationship may be stored in a storage module of the computing unit and the fault determination module may query the mapping relationship from the storage module on-the-fly. In this way, the corresponding relation between the fault node and the fault determination condition can be established, so that automatic fault determination is realized, and further, fault derivation is realized automatically.
In a second aspect of the present disclosure, an apparatus for fault analysis is provided. The apparatus comprises functional modules for implementing the first aspect or any one of the implementation manners of the first aspect.
In a third aspect of the disclosure, an electronic device is provided. The electronic device includes: at least one computing unit; at least one memory coupled to the at least one computing unit and storing instructions for execution by the at least one computing unit, the instructions when executed by the at least one computing unit, cause the apparatus to perform the first aspect or a method in any one implementation of the first aspect.
In a fourth aspect of the disclosure, a computer-readable storage medium is provided. The computer readable storage medium stores one or more computer instructions for execution by the processor to implement the first aspect or the method in any one of the implementations of the first aspect.
In a fifth aspect of the disclosure, a computer program product is provided. The computer program product comprises computer executable instructions which, when executed by a processor, cause a computer to perform the steps of the first aspect or part or all of the steps of the method in any one of the implementations of the first aspect.
It is to be understood that the apparatus for failure analysis of the second aspect, the electronic device of the third aspect, the computer storage medium of the fourth aspect, or the computer program product of the fifth aspect provided above are all adapted to implement the method provided by the first aspect. Therefore, explanations or illustrations regarding the first aspect are equally applicable to the second, third, fourth, and fifth aspects. In addition, the beneficial effects achieved by the second aspect, the third aspect, the fourth aspect and the fifth aspect may refer to the beneficial effects in the corresponding method, and are not described herein again.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:
FIG. 1 illustrates a schematic diagram of an example system in which various embodiments of the present disclosure can be implemented;
FIG. 2 shows a schematic diagram of a computing unit according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a mapping of a fault tree to fault decision conditions, according to an embodiment of the present disclosure;
FIG. 4 shows a flow diagram of a process for fault analysis according to an embodiment of the present disclosure;
FIG. 5 shows a schematic block diagram of an apparatus for fault analysis in accordance with some embodiments of the present disclosure; and
FIG. 6 illustrates a block diagram of a computing unit capable of implementing various embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below. Further, herein, "and/or" is used to mean at least one object of a plurality of objects. For example, "A and/or B" means "A", "B", and one of "A and B".
As discussed above, with the increase of system data, fault alarm and network data, the traditional operation and maintenance mechanism relying on manual mode has the disadvantages of huge workload and long fault location time, and can not meet the daily requirements of network systems. In addition, because the network failures, especially the network card device failures, are of various types, users need abundant operation and maintenance experience and technical skill, which makes the technical threshold of network operation and maintenance work higher.
In order to solve the above problems of manual operation and maintenance and realize the automation of the fault analysis of the network card device, the present disclosure provides a scheme for analyzing the fault of the network card device in a fault layering manner. By way of example, the present disclosure enables automation of fault analysis by hierarchically defining various fault patterns and by correlating the fault pattern of the last layer with system data collected in real-time. Specifically, when at least part of the data in the system data is determined to meet the determination condition corresponding to the last layer of fault mode, the fault modes possibly existing in each layer can be deduced layer by layer upwards from the last layer of fault mode, so as to form a complete fault map or fault chain.
In this way, the failure modes of all the layers can be automatically and quickly analyzed and determined, failure deduction is not needed by operation and maintenance personnel through experience, and more detailed reference information can be provided for subsequent failure processing of a user.
To more accurately describe the concepts of the present disclosure, an example system according to an embodiment of the present disclosure is described in detail below in conjunction with fig. 1.
Example System
According to various embodiments of the present disclosure, a scheme for analyzing network card device failures is provided. In the embodiment of the disclosure, data to be analyzed of the network card device is firstly acquired. Alternatively or additionally, cause and effect information may be further obtained, the cause and effect information is used to analyze the failure of the network card device, and each cause and effect node in the cause and effect information may indicate one failure of the network card device. Subsequently, it may be determined by the calculation unit whether the data to be analyzed satisfies the failure determination condition. The calculation unit may determine at least one leaf node in the causal relationship information corresponding to a failure determination condition if it is determined that the data to be analyzed satisfies the failure determination condition corresponding to the at least one leaf node of the causal relationship information. Furthermore, the association node of the at least one leaf node in the causal relationship information may be determined by the calculation unit. It should be understood that the leaf node and the associated node are both causal nodes in the causal information. And determining a fault analysis result for the network card equipment at least based on the fault indicated by the associated node.
Preferably, in addition to the failure indicated only by the associated node, the failure analysis result for the network card device may be determined based on the failure indicated by the leaf node. It should be understood that the fault analysis result of fault determination based on the associated node indication is mainly used for showing deeper cause of the fault, which is beneficial to quickly determining the solution of the fault; accordingly, a comprehensive fault map can be displayed to workers based on the fault analysis result of the fault determination indicated by the associated node and the leaf node, and accurate positioning of the fault is facilitated. Of course, in order to save computational resources, the failure analysis result may be generated based on only a part of the associated nodes whose occurrence probability is greater than the threshold probability.
Based on the mode, the embodiment of the disclosure can systematically determine each fault and the deeper cause of the fault, so that each fault situation of the network card device can be fully considered, and the fault of the network card device can be deduced based on the fault tree. In the process, the related data of the network card equipment can be monitored through the computing equipment, the fault reason and the root cause thereof can be automatically determined when the data is abnormal, and operation and maintenance personnel do not need to collect the related data, know the incidence relation among internal faults and determine specific faults based on experience, so that the operation and maintenance difficulty is reduced, and the time for troubleshooting is saved.
Fig. 1 illustrates a schematic diagram of an example system 100 in which various embodiments of the present disclosure can be implemented.
As shown in fig. 1, system 100 includes a source machine 110, a gateway 120, and a destination machine 130. The source machine 110 may be a physical machine. A physical machine generally refers to a hardware system as an entity, and may be a hardware device such as a server, a personal computer, and the like. It should be understood that source machine 110 may contain computing resources such as processors and storage resources such as memory in order to implement the respective computing functionality. Further, as shown in fig. 1, the source machine 110 may include a network card device 140 implemented in hardware as a physical machine. It should be understood that although not shown, the source machine 110 may also include an operating system/driver module and a network process module for network interaction therein. The network card device 140 is used to send data packets to the gateway 120 or receive data packets from the gateway 120. Gateway 120 may forward received packets to destination machine 130 or pass data from destination machine 130 to source machine 110 as a connector between different networks. Similar to the source machine, the destination machine 130 may also be a physical machine. It should be understood that a switch is also typically provided between the source 110 and the gateway 120, and the gateway 120 may be connected to the destination 130 typically via a network.
The foregoing describes a conventional physical machine-based, cross-gateway communication deployment in which source machine 110 is communicatively coupled to destination machine 130 via gateway 120. Besides, the source machine 110 and the destination machine 130 can be communicatively connected through a virtual machine-physical machine communication deployment mode. As an example, source machine 110 may be a virtual machine and destination machine 130 may be a physical machine. A virtual machine generally refers to a virtual system having complete hardware system functions, which is simulated by software and runs in a completely isolated environment. As shown in fig. 1, the source machine 110 as a virtual machine may include a virtual network card device 140. The network card device 140 is used to send data packets to the gateway 120 or receive data packets from the gateway 120. Gateway 120 may forward received packets to destination machine 130 or pass data from destination machine 130 to source machine 110 as a connector between different networks. Similarly to the above, a switch is also typically provided between the source machine 110 and the gateway 120, and the gateway 120 may be connected to the destination machine 130 typically via a network. In this example, the source 110, the gateway 120, and the switches disposed therebetween collectively form a source network. Thereby, network communication of the virtual machines across the physical machines can be realized.
Alternatively, the source machine 110 and the destination machine 130 may also implement a communication connection through a network direct connection mode of the virtual machine on the physical machine. As an example, as shown in fig. 1, source machine 110 may be a virtual machine and destination machine 130 is also a virtual machine, and source machine 110 and destination machine 130 may communicate through respective switches and gateways 120 deployed in the same physical machine. Additionally, the source machine 110 and the destination machine 130 may also be communicatively connected through other modes of virtual machines on physical machines. As an example, as shown in fig. 1, source machine 110 may be a virtual machine and destination machine 130 is also a virtual machine, and source machine 110 and destination machine 130 may communicate through a virtual switch and virtual gateway 120 deployed in the same physical machine.
It should be noted that the system 100 may also include a computing unit 150, as shown in fig. 1. The calculation unit 150 is configured to perform failure detection on the network card device based on the data to be analyzed acquired from the network card device 140. When a processor or other computing resource is disposed in the network card device 140, the computing unit 150 may be disposed in the processor of the network card device 140. Furthermore, the computing unit 150 may also be arranged in a processor of the source machine 110. Alternatively or additionally, the computing unit 150 may also be arranged in the cloud. It should be understood that wherever the computing unit 150 is disposed, the computing unit 150 is communicatively connected to the network card device 140. In some embodiments, the computing unit 150, when disposed in the cloud, may be implemented by a personal computer, server computer, handheld or laptop device, mobile device (such as a mobile phone, personal digital assistant PDA, media player, etc.), consumer electronics, minicomputer, mainframe computer, cloud computing resource, and the like. It should be understood that the computing unit 150 may be provided in a device for implementing fault analysis or other devices for implementing corresponding functions through fault monitoring.
To facilitate a better understanding of the present disclosure, an example structure of the computing unit 150 will be described in detail below with reference to fig. 2.
Example Structure of computing Unit
Fig. 2 shows a schematic diagram of a computing unit 150 according to an embodiment of the present disclosure. As shown in fig. 2, the computing unit 150 may implement at least the hardware logic of the data collection module 210, the fault determination module 220, and the fault derivation module 230. Alternatively, the computing unit 150 may implement the corresponding functions by reading software codes storing the data acquisition module 210, the fault determination module 220, and the fault derivation module 230 and executing the corresponding software codes or program instructions. As an example, when the computing unit 150 is implemented by a processor of the network card device 140, the data acquisition module 210, the failure determination module 220, and the failure derivation module 230 may be stored in a memory of the network card device 140, and the computing unit 150 reads software codes or program instructions of the data acquisition module 210, the failure determination module 220, and the failure derivation module 230 from the memory of the network card device 140 to implement corresponding functions.
In some embodiments, the data collection module 210 may transmit the data to be analyzed 240 acquired from the network card device 140 to the failure determination module 220. As an example, when the computing unit 150 implements the hardware logics of the data acquisition module 210, the failure determination module 220, and the failure derivation module 230, the data acquisition module 210 of the hardware logics may be in communication connection with the network card device 140, so as to obtain the data 240 to be analyzed from the network card device 140 in real time. If the data collection module 210 is located in the network card device 140, the data collection module 210 may obtain the data to be analyzed 240 from a log storage module of the network card device 140 or a data transmission line. As another example, when the computing unit 150 reads the software codes or program instructions of the data collection module 210, the fault determination module 220, and the fault derivation module 230 from the memory of the network card device 140 to implement the corresponding functions, the computing unit 150 may implement the data collection function by reading the software codes or program instructions of the data collection module 210 from the corresponding storage unit of the network card device 140, thereby acquiring the data to be analyzed 240 from the log storage module or the data transmission line of the network card device 140.
To implement real-time monitoring of network card failure, the data collection module 210 may transmit the data to be analyzed 240 acquired in real time to the failure determination module 220 via, for example, a data transmission path. As an example, in a case that the computing unit 150 implements hardware logic of the data collection module 210, the fault determination module 220, and the fault derivation module 230, the hardware logic may receive the data 240 to be analyzed of the network card device 140 in real time. As another example, when the computing unit 150 reads the software codes or program instructions of the data collection module 210, the failure determination module 220, and the failure derivation module 230 from the memory of the network card device 140 to realize the corresponding functions, the computing unit 150 may realize the failure determination function by reading the software codes or program instructions of the failure determination module 220, thereby performing failure determination on the data to be analyzed 240 acquired by the computing unit 150.
It should be understood that the data to be analyzed 240 may include operational data and/or log data associated with the network card device 140. The operation data generally refers to the detected performance or other operating parameters of the network card device 140. As an example, the operation data may be the number of gateway loopback short packet loss packets of the network card device 140, that is, the number of short packets lost in the process of sending the short packets from the network card device 140 to the gateway 120 and receiving the returned short packets. In addition, the log data generally refers to log event records generated during the operation of the network card device 140. As an example, the log data may be that the connection status of the network card device 140 is "Unlink". The data to be analyzed 240 is the basis of the failure analysis process performed by the calculation unit 150. By monitoring the operation data and/or log data of the network card device, all possible faults of the network card device can be detected more comprehensively, and especially the fault which can be judged only when the operation data and the log data are abnormal at the same time can be detected.
In some embodiments, after receiving the data to be analyzed 240 from the data collection module 210, the failure determination module 220 performs a failure determination process on the data to be analyzed 240. In some embodiments, the fault determination module 220 may compare each or each group of data in the data 240 to be analyzed with the preloaded fault determination conditions 250 one by one or in parallel to determine whether there is data in the data 240 to be analyzed that meets the fault determination conditions 250. As an example, the preloaded failure determination conditions 250 may be a configuration file pre-written by a user that contains several specific failure determination conditions that correspond to specific failures of the network card device 140. For example, after writing a configuration file about the failure determination condition 250, the user may load the configuration file into the computing unit 150, and the computing unit 150 may store the failure determination condition 250 in the corresponding storage module. Alternatively, the user may store the programmed configuration file regarding the failure determination condition 250 directly to the corresponding storage module in preparation for the computing unit 150 to invoke the failure determination condition 250 therein. Thus, the failure determination module 220 can acquire the latest version of the failure determination condition 250 from the storage module via a data transmission path such as a bus in order to perform failure determination on the data to be analyzed 240. It should be understood that the plurality of fault determination conditions and associated specific faults in the fault determination conditions 250 may be predetermined by a user. For example, the computing unit 150 may receive a plurality of specific faults and corresponding fault determination conditions predetermined by a user based on experience or historical data, and periodically optimize the configuration file of the fault determination conditions 250 according to a predetermined policy based on the specific faults and the corresponding fault determination conditions. The optimization process of the fault determination condition 250 will be described in detail below.
Thus, if the failure determination module 220 determines that the data to be analyzed 240 satisfies a certain determination condition among the failure determination conditions 250, a failure node corresponding to the determination condition may be determined as the leaf node information 260. In turn, the fault derivation module 230 can further determine the associated node of the faulty node in the fault tree 270 based on the determined leaf node information 260 and the preloaded fault tree 270. It should be understood that the present disclosure is not limited to the use of the fault tree 270 in making fault inferences, and may also use a lookup table of data structure types. By way of example, it should be understood that each node in the fault tree 270 may be used to indicate a fault of the network card device 140. Thus, the computing unit 150 may derive an associated node indicating a higher level fault based on the detected one or more specific faults, and the preloaded fault tree 270. The computing unit 150 may output a failure analysis result 280 for the network card device 140 based on the failure indicated by the associated node.
To facilitate a better understanding of the present disclosure, an example structure of the fault tree 270 loaded at the fault derivation module 230 will be described in detail below with reference to fig. 3.
Example Structure of Fault Tree and Fault determination Condition
Fig. 3 shows a schematic diagram of a mapping 300 of a fault tree 270 to fault decision conditions 250 according to an embodiment of the disclosure. It should be understood that the mapping 300 may be loaded in the failure determination module 220 along with the failure determination conditions 250, or the mapping 300 may be stored in a storage module of the computing unit 150 and the failure determination module 220 may immediately query the mapping 300 from the storage module.
As shown in fig. 3, the fault tree 270 may contain nodes at multiple levels. As an example, the fault tree 270 may include a root level node (typically a single node, such as the "network card transport fault" node in fig. 3) 271, a plurality of high level nodes 272, a plurality of middle level nodes 273, and a plurality of leaf nodes 274. The root level node 271 may represent the maximum classification of the corresponding fault type. As an example, when the network card device 140 is determined to be an object to be monitored, the largest classification of the failure type of the network card device 140 may be "network card transmission failure". Multiple higher level nodes 272 may contain several more detailed fault classifications as downstream branches of the largest classification of fault types. For example, the root level node 271 of the "network card transmission failure" may include a plurality of high level nodes 272 such as "loopback exception", "buffer failure", "error message is excessive", "short packet is excessive", "network bandwidth is overloaded", "network card configuration is wrong", and "driving exception".
Accordingly, the higher level node 272 may further have a more detailed fault classification. As shown in fig. 3, the "loopback exception" node as the upper level node 272 may correspond to the "gateway loopback exception" node as the middle level node 273 and the "destination loopback exception" node. The "buffer failure" node as the upper level node 272 may correspond to a "receive buffer exception" node and a "transmit buffer exception" node as the middle level node 273. An "error message too many" node as a high level node 272 may correspond to a "received message too many" node and a "sent message too many" node as a middle level node 273.
It should be understood that the downstream nodes of the high-level node 272 may include partial leaf nodes 273 in addition to the middle-level node 273. As shown in fig. 3, the "short packet too many" node as the upper level node 272 may correspond to the "short packet too many received" node and the "short packet too many transmitted" node as the leaf node 274. The "network bandwidth overload" node as the upper level node 272 may correspond to a "receiving network bandwidth overload" node and a "transmitting network bandwidth overload" node as the leaf nodes 274. The "network card misconfiguration" node as the high level node 272 may correspond to the "working state failure" node and the "network speed adaptive configuration exception" node as the leaf node 274. The "drive exception" node as the high level node 272 may correspond to the "drive version mismatch" node as the leaf node 274.
Accordingly, the middle tier node 273 may further have a more detailed fault classification. As shown in fig. 3, the "gateway loopback anomaly" (i.e., an anomaly occurs during the process of returning a data packet from the source 110 to the gateway 120 to the source 110) node as the middle level node 273 may correspond to the "short packet loss" node, the "short packet response time too long" node, the "long packet loss" node, and the "long packet response time too long" node as the leaf node 274. The "destination loopback anomaly" (i.e., an anomaly in the process of returning a packet from the source machine 110 to the destination machine 130 to the source machine 110) node as the middle level node 273 may correspond to the "short packet loss" node, the "short packet response time too long" node, the "long packet loss" node, and the "long packet response time too long" node as the leaf node 274. The "receive buffer exception" node as middle level node 273 may correspond to the "receive packet exception discard" node and the "receive overflow error" node as leaf node 274. The "send buffer exception" node as middle level node 273 may correspond to the "send packet exception discard" node and the "send overflow error" node as leaf node 274. The "too many received packet error" node as the middle level node 273 may correspond to the "too many received packet error" node and the "too many received frame error" node as the leaf node 274. The "send packet too many error" node as the middle level node 273 may correspond to the "send error packet too many" node, the "send collision packet too many" node, and the "send carrier too many" node as the leaf node 274.
It should be understood that by querying the mapping relationship 300, each leaf node 274 has a one-to-one correspondence with a respective one of the fault decision conditions 250. As an example, the leaf node "short packet loss" (gateway) corresponds to the decision condition "gateway loopback short packet loss >0". The leaf node "short packet response time too long" (gateway) corresponds to the decision condition "gateway short packet loopback response time >10ms". The leaf node "long packet loss" (gateway) corresponds to the determination condition "the number of gateway loopback long packet loss >0". The leaf node "long packet response time too long" (gateway) corresponds to the decision condition "gateway long packet loopback response time >20ms". The leaf node "short packet loss" (link) corresponds to the determination condition "link loopback short packet loss >0". The leaf node "short packet response time too long" (link) corresponds to the decision condition "link short packet loopback response time >50ms". The leaf node "long packet loss" (link) corresponds to the determination condition "link loopback long packet loss >0". The leaf node "long packet response time too long" (link) corresponds to the decision condition "link long packet loopback response time >50ms". The leaf node "receive packet exception discard" corresponds to the determination condition "receive buffer lost packet number/receive packet number >0.1%". The leaf node "reception overflow error" corresponds to the determination condition "reception buffer overflow packet number/reception packet number >0.1%". The leaf node "send packet exception discard" corresponds to the determination condition "number of send buffer lost packets/number of send packets >0.1%". The leaf node "transmission overflow error" corresponds to the determination condition "number of transmission buffer overflow packets/number of transmission packets >0.1%". The leaf node "too many reception error packets" corresponds to the determination condition "number of reception damaged packets/number of reception packets >0.1%". The leaf node "too many received frame errors" corresponds to the determination condition "number of received frame errors/number of received packets >0.1%". The leaf node "too many transmission error packets" corresponds to the determination condition "number of transmission damaged packets/number of transmission packets >0.1%". The leaf node "too many transmission collision messages" corresponds to the determination condition "number of transmission collisions/number of transmission packets >0.1%". The leaf node "excessive transmission carrier error" corresponds to the determination condition "number of transmission carrier errors/number of transmission packets >0.01%".
By defining the determination conditions of all the fault classifications concerning the loopback abnormal fault in detail, automatic and fine fault determination can be realized. In addition, through layering a plurality of fault nodes, layer-by-layer derivation of faults can be achieved, accuracy of fault derivation can be improved, and derivation difficulty is reduced.
It should be understood that when the failure determination module 220 of the computing unit 150 determines that at least part of the data to be analyzed 240 satisfies part of the failure determination conditions 250, a corresponding leaf node may be determined from the leaf nodes 274, and the failure derivation module 230 of the computing unit 150 may determine an upper-level associated node, e.g., a parent node, of the above-mentioned leaf node based on the failure tree 270 described in detail above.
In some embodiments, as shown in fig. 3, each leaf node in the fault tree 270 corresponds to only one fault decision condition, and a combination of two or more lower level nodes may derive an upper level associated node. As an example of the or relationship, as shown in fig. 3, the leaf node "too many received error messages" and the leaf node "too many received frame errors" can both derive the upper layer associated node "too many received message errors".
It should be appreciated that when deriving the associated nodes upward based on leaf nodes, if relationships that are all "or" along the path are derived, the root node in the fault tree 270 may be derived directly based on one or more leaf nodes. As an example, the calculation unit 150 may determine a logical relationship between "too many received error packets" of a leaf node and "too many received frame errors" of a leaf node according to causal relationship information such as the fault tree 270. If the logical relationship is determined to be "or", the calculation unit 150 may determine the associated node based on one of "too many received error messages" of the leaf node and "too many received frame errors" of the leaf node. That is, when determining the logical relationship existing between two leaf nodes, since the failure of one of the leaf nodes is determined, the associated node as the upper node thereof can be directly determined.
Further, although not shown in fig. 3, the upper level association node may also be derived by the and relationship. As an example, the upper-layer associated node of "too many network card receiving collision messages" can be derived only when the conditions that the leaf node "the number of collision messages received per second reaches the predetermined threshold" and the leaf node "the network card operating mode is half duplex" are both satisfied. It should be understood that when an and relationship exists in the derivation path, the operation of deriving the associated node upward may stop when the corresponding derivation relationship is not satisfied. Alternatively or additionally, the threshold number of levels deduced upwards, whether in an "or" and "relationship, may be limited, for example, to deduce only failures of at most three levels. As an example, if it is determined that the logical relationship is and, the calculation unit 150 may determine whether the leaf node "the number of received collision messages per second reaches the predetermined threshold" and the leaf node "the network card operation mode is half duplex" both satisfy the condition, and if so, may determine that "the network card reception collision message is excessive" as an associated node of its upper node based on the two leaf nodes.
By limiting the logical relation of the 'sum' and the 'and', the derivation precision of the upper-layer associated nodes can be improved, the diversity of the fault tree is enriched, and the accuracy of fault analysis is improved.
Optimization of fault determination conditions
As previously described, the failure determination condition 250 is predetermined by the user based on experience or historical data. Thus, it may still be the case that an inaccurate fault analysis result 280 is determined based on individual ones of the fault determination conditions 250. For this reason, optimization of the failure determination condition is required.
In certain embodiments, updated fault determination conditions 250 may be determined based on fault analysis results 280 by a pre-trained machine learning model. Specifically, the update optimization of the failure determination condition 250 may be divided into two phases: a training phase and an application phase of the machine learning model.
In the model training phase, the machine learning model may be trained using a training data set based on historical data labels. As an example, the fault analysis results may be modified when the user finds that the fault analysis results 280 are inaccurate or unsatisfactory. Thus, a large amount of historical data may be collected, and the machine learning model may be trained using the modified fault analysis results as inputs to the machine learning model, and the corresponding fault determination conditions as inputs to the machine learning model.
In the model application phase, the failure analysis result 280 determined to be inaccurate may be input into the trained machine learning model, and the result output by the machine learning model is the updated failure determination condition 250. In this way, automatic optimization of fault determination conditions can be realized, and the accuracy of fault analysis is improved.
Example Processes, apparatus, and devices
Fig. 4 shows a flow diagram of a process 400 for fault analysis according to an embodiment of the present disclosure. In certain embodiments, method 400 may be implemented in computing unit 150 in fig. 1 and 2, as well as the devices illustrated in fig. 6. A process 400 for fault analysis according to an embodiment of the present disclosure is now described with reference to fig. 1 and 2. For ease of understanding, the specific examples set forth in the following description are intended to be illustrative, and are not intended to limit the scope of the disclosure.
At block 402, data to be analyzed of the network card device 140 may be obtained by a user or the computing unit 150. In addition, the computing unit 150 may further obtain causal relationship information, such as the fault tree 270, for analyzing the fault of the network card device 140. Each node in the fault tree 270 may be used to indicate a type of fault for the network card device 140. In some embodiments, the computing unit 150 may obtain the operation data of the network card device 140 in real time through the data acquisition module 210. As an example, the operation data of the network card device 140 may include operation data and/or log data of the network card device 140. The operational data may be performance or other operating parameters of the network card device 140, and the log data may be log event records generated by the network card device 140 during operation. Alternatively or additionally, the computing unit 150 may also obtain additional modules defined by the user himself, so that fault determination conditions applicable to different application scenarios may be added to the computing unit 150. As an example, the additional module may be a loopback anomaly detection module. That is, when a user needs to update or modify the failure determination conditions 250, he or she may mount additional modules to the failure determination module 220 in the computing unit 150 in order to update one or more of the failure determination conditions 250.
In block 404, the computing unit 150 may compare the data to be analyzed with a plurality of fault determination conditions 250 corresponding to the fault tree 270 to determine whether data meeting the fault determination conditions 250 exists in the data to be analyzed, and if it is determined that the data to be analyzed meets the fault determination conditions corresponding to at least one leaf node of the fault tree 270, proceed to block 406. As an example, the data to be analyzed may be the number of gateway loopback short packet loss packets of the network card device 140. If the packet loss number of the obtained gateway loopback short packet is greater than 0, it can be determined that the data to be analyzed meets a fault determination condition. Thus, at block 406, the calculation unit 150 may determine a leaf node corresponding to the failure determination condition from the failure tree 270 and determine a parent node of the leaf node based on the failure tree 270, and if there is an upper hierarchy node in the parent node, may continue to determine the parent node of the leaf node until a root node or a node located at a higher hierarchy of the failure tree, i.e., a relevant node, is derived.
At block 408, the computing unit 150 may further determine the associated nodes of the leaf nodes in the fault tree 270. It should be understood that the leaf nodes and associated nodes are both causal nodes in causal information such as the fault tree 270. Further, at block 410, the computing unit 150 may output the failure analysis result 280 for the network card device 140 based at least on the failure indicated by the associated node (and may also be based on the failure indicated by the leaf node). As an example, as shown in fig. 3, if the computing unit 150 determines that the leaf node "receives too many error messages" and "sends too many error messages" based on the failure determination condition, the computing unit 150 may determine that the middle level node "receives too many messages" of the leaf node "receives too many error messages" and the middle level node "sends too many error messages" of the leaf node "sends too many error messages" based on the failure tree 270, and may further determine that the high level node "has too many error messages" based on the middle level node "receives too many messages" or the middle level node "sends too many messages", and finally determine that the root level node "network card transmission failure". Thus, the computing device 150 may present a failure map or failure chain such as "too many received error messages and too many sent error messages → too many received message errors and too many sent message errors → too many error messages → network card transmission failures" to the user through an output unit such as a display screen as the failure analysis result 280.
Through the mode, the problems of manual operation and maintenance are solved, and the fault analysis automation of the network card equipment is realized. By associating each piece of data to be analyzed of the network card device 140 with a fault of a leaf node in the fault tree 270, it may be determined whether a fault indicated at the corresponding leaf node occurs based on one or more pieces of data in the piece of data to be analyzed. Therefore, the fault modes of all the layers can be automatically and quickly analyzed and determined, users such as operation and maintenance personnel do not need to carry out fault derivation by experience, and more detailed reference information can be provided for subsequent fault processing of the users. Specifically, on the one hand, the fault detection process provided by the above embodiment can effectively traverse all possible faults, thereby improving the accuracy and comprehensiveness of the fault detection process. On the other hand, after determining the leaf node corresponding to the failure, the embodiment further derives the upper layer associated node of the leaf node based on the failure tree, so that the root cause of the failure of the network card device can be accurately determined. Therefore, the fault deduction process simplifies the troubleshooting difficulty of users such as operation and maintenance personnel and reduces troubleshooting time.
In some embodiments, the fault analysis results 280 may be presented to the user in a glance mode. As an example, only each level of nodes in fault tree 270 that satisfy the fault condition may be presented. Alternatively or additionally, the fault analysis results 280 may also be presented in a detailed mode. By way of example, the user is presented with a complete fault tree 270, regardless of whether any faults are detected, while the partial nodes determined to be faulty may be distinguished by a special symbol or color. In this way, the user can find the association relationship between the faults, namely the 'fault graph' or the 'fault chain'. Thereby providing detailed reference information for the subsequent failure resolution process.
Fig. 5 shows a schematic block diagram of an apparatus 500 for fault analysis according to some embodiments of the present disclosure. As shown in fig. 5, the apparatus 500 may include an obtaining module 502 configured to obtain the data 240 to be analyzed of the network card device 140 and the fault tree 270 for analyzing the fault of the network card device 140. In some embodiments, each node in the fault tree 270 indicates a fault of the network card device 140. The apparatus 500 may also include a leaf node determination module 504. When it is determined that the data to be analyzed 240 satisfies the failure determination condition 250, the leaf node determination module 504 may determine a leaf node in the failure tree 270 corresponding to the failure determination condition 250. The apparatus 500 may further include an associated node determining module 506 for determining associated nodes of leaf nodes in the fault tree 270. The apparatus 500 may further include a fault analysis result determination module 508. The failure analysis result determination module 508 may determine the failure analysis result 280 for the network card device 140 based on the failure indicated by the associated node.
In some embodiments, the leaf nodes may include a first leaf node and a second leaf node, and the associated node determination module 506 may be further configured to: if it is determined that the data to be analyzed 240 satisfies the failure determination condition corresponding to the first leaf node and the failure determination condition corresponding to the second leaf node, the associated node is determined.
In some embodiments, the associated nodes may include at least one of a root node, a parent node of a leaf node, a parent node of a leaf node in the fault tree 270.
In some embodiments, the data to be analyzed 240 may include at least one of operation data and log data of the network card device 140.
In some embodiments, the fault determination condition may include at least one of: the packet loss number of a gateway loopback short packet of the network card equipment is greater than a first threshold number; the gateway short packet loopback response time of the network card equipment is longer than the first threshold response time; the number of the lost packets of the gateway loopback long packets of the network card equipment is greater than the second threshold number; the gateway long packet loopback response time of the network card equipment is longer than the second threshold response time; the link loopback short packet loss number of the network card equipment is greater than the third threshold number; the link short packet loopback response time of the network card equipment is longer than the third threshold response time; the packet loss number of the link loopback long packet of the network card equipment is greater than the fourth threshold number; and the loopback response time of the link long packet of the network card equipment is greater than the fourth threshold response time.
In some embodiments, the apparatus 500 may further include: an analysis result comparison module configured to compare the fault analysis result with an actual analysis result, the actual analysis result being determined by a user; and an updating module configured to apply the actual analysis result to a decision condition updating model to determine an updated fault decision condition if the fault analysis result is different from the actual analysis result, wherein the decision condition updating model is trained based on a reference analysis result as an input of the decision condition updating model and a labeled reference fault decision condition as an output of the decision condition updating model.
Fig. 6 illustrates a schematic block diagram of an example device 600 that can be used to implement embodiments of the present disclosure. As shown in fig. 6, the device 600 comprises a computing unit 601, which may perform various suitable actions and processes according to computer program instructions stored in a Random Access Memory (RAM) 603 and/or a Read Only Memory (ROM) 602, or loaded from a storage unit 608 into the RAM 603 and/or ROM 602. In the RAM 603 and/or the ROM 602, various programs and data required for the operation of the device 600 can also be stored. The computing unit 601 and the RAM 603 and/or the ROM 602 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, and the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as the process 400. For example, in some embodiments, process 400 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the RAM 603 and/or the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and/or ROM 602 and executed by computing unit 601, one or more steps of process 400 described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the process 400 in any other suitable manner (e.g., by way of firmware).
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (17)

1. A method of fault analysis, the method comprising:
acquiring data to be analyzed of the network card equipment;
if the data to be analyzed meets the fault judgment condition, determining leaf nodes corresponding to the fault judgment condition in the causal relationship information, wherein the causal relationship information is used for analyzing the fault of the network card equipment, and each causal relationship node in the causal relationship information indicates one fault of the network card equipment;
determining an associated node of the leaf node in the causal relationship information, wherein the leaf node and the associated node are both causal relationship nodes in the causal relationship information; and
and determining a fault analysis result aiming at the network card equipment at least based on the fault indicated by the associated node.
2. The method of claim 1, wherein the causal relationship node further comprises an additional leaf node, wherein the association node is an upper node of the leaf node and the additional leaf node, and wherein determining the association node comprises:
determining a logical relationship between the leaf node and the additional leaf node based on the causal relationship information;
and if the logical relationship is determined to be OR, determining the association node based on the leaf node.
3. The method of claim 1, wherein the association nodes comprise a root node in the causal relationship information and at least one node between the leaf node and the root node.
4. The method of claim 1, wherein obtaining the data to be analyzed comprises:
acquiring operation data of the network card equipment from an operation system interacted with the network card equipment; and
and acquiring log data related to the network card equipment by analyzing a log system of the operating system.
5. The method of claim 1, wherein determining the fault analysis result based at least on the fault indicated by the associated node comprises:
determining a fault graph based on a first fault indicated by the associated node and a second fault indicated by the leaf node, wherein the fault graph comprises the first fault, the second fault and a causal relationship between the first fault and the second fault; and
and determining the fault graph as the fault analysis result.
6. The method of claim 1, further comprising:
comparing the fault analysis result with an actual analysis result, the actual analysis result being determined by a user; and
applying the actual analysis result to a determination condition update model to determine an updated fault determination condition if the fault analysis result is different from the actual analysis result,
wherein the decision condition update model is trained based on a training data set comprising reference analysis results and labeled reference fault decision conditions.
7. The method of claim 1, wherein the causal relationship information is a fault tree for performing fault derivation.
8. An apparatus for fault analysis, comprising:
the acquisition module is configured to acquire data to be analyzed of the network card device;
a leaf node determination module configured to determine a leaf node corresponding to a fault determination condition in causal relationship information if it is determined that the data to be analyzed satisfies the fault determination condition, where the causal relationship information is used for analyzing a fault of the network card device, and each causal relationship node in the causal relationship information indicates a fault of the network card device;
an association node determination module configured to determine an association node of the leaf node in the causal relationship information, wherein the leaf node and the association node are both causal relationship nodes in the causal relationship information; and
a failure analysis result determination module configured to determine a failure analysis result for the network card device based on at least a failure indicated by the associated node.
9. The apparatus of claim 8, wherein the causal relationship node further comprises an additional leaf node, wherein the association node is an upper node of the leaf node and the additional leaf node, and wherein the association node determination module is further configured to:
determining a logical relationship between the leaf node and the additional leaf node based on the causal relationship information;
and if the logical relationship is determined to be OR, determining the associated node based on the leaf node.
10. The apparatus of claim 8, wherein the association nodes comprise a root node in the causal relationship information and at least one node between the leaf node and the root node.
11. The apparatus of claim 8, wherein the acquisition module is further configured to:
acquiring operation data of the network card equipment from an operation system interacted with the network card equipment; and
and acquiring log data related to the network card equipment by analyzing a log system of the operating system.
12. The apparatus of claim 8, wherein the fault analysis result determination module is further configured to:
determining a fault graph based on a first fault indicated by the associated node and a second fault indicated by the leaf node, wherein the fault graph comprises the first fault, the second fault and a causal relationship between the first fault and the second fault; and
and determining the fault map as the fault analysis result.
13. The apparatus of claim 8, further comprising:
an analysis result comparison module configured to compare the fault analysis result with an actual analysis result, the actual analysis result being determined by a user; and
an update module configured to apply the actual analysis result to a determination condition update model to determine the updated fault determination condition if the fault analysis result is different from the actual analysis result,
wherein the decision condition update model is trained based on a training data set containing reference analysis results and labeled reference fault decision conditions.
14. The method of claim 8, wherein the causal information is a fault tree for performing fault derivation.
15. An electronic device, comprising:
at least one computing unit;
at least one memory coupled to the at least one computing unit and storing instructions for execution by the at least one computing unit, the instructions when executed by the at least one computing unit, cause the apparatus to perform the method of any of claims 1-7.
16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
17. A computer program product comprising computer executable instructions, characterized in that the computer executable instructions, when executed by a processor, implement the method according to any of claims 1-7.
CN202110522995.8A 2021-05-13 2021-05-13 Fault analysis method, apparatus, device, storage medium and program product Pending CN115348147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110522995.8A CN115348147A (en) 2021-05-13 2021-05-13 Fault analysis method, apparatus, device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110522995.8A CN115348147A (en) 2021-05-13 2021-05-13 Fault analysis method, apparatus, device, storage medium and program product

Publications (1)

Publication Number Publication Date
CN115348147A true CN115348147A (en) 2022-11-15

Family

ID=83946780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110522995.8A Pending CN115348147A (en) 2021-05-13 2021-05-13 Fault analysis method, apparatus, device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN115348147A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234806A (en) * 2023-09-22 2023-12-15 深圳市联瑞电子有限公司 Automatic restarting method and system for network card

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234806A (en) * 2023-09-22 2023-12-15 深圳市联瑞电子有限公司 Automatic restarting method and system for network card
CN117234806B (en) * 2023-09-22 2024-04-30 深圳市联瑞电子有限公司 Automatic restarting method and system for network card

Similar Documents

Publication Publication Date Title
CN110493042B (en) Fault diagnosis method and device and server
WO2020001642A1 (en) Operation and maintenance system and method
US11348023B2 (en) Identifying locations and causes of network faults
US11706079B2 (en) Fault recovery method and apparatus, and storage medium
CN111507363A (en) Method, device and equipment for predicting fault of optical module
CN115118581B (en) Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G
US10447561B2 (en) BFD method and apparatus
CN106209405A (en) Method for diagnosing faults and device
CN113516244B (en) Intelligent operation and maintenance method and device, electronic equipment and storage medium
CN116719664B (en) Application and cloud platform cross-layer fault analysis method and system based on micro-service deployment
CN115529595A (en) Method, device, equipment and medium for detecting abnormity of log data
KR20200128144A (en) Method and apparatus for determining the state of network devices
CN115348147A (en) Fault analysis method, apparatus, device, storage medium and program product
CN116723136B (en) Network data detection method applying FCM clustering algorithm
CN113468022A (en) Automatic operation and maintenance method for centralized monitoring of products
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
CN116974805A (en) Root cause determination method, apparatus and storage medium
CN112838942A (en) Network operation and maintenance method, electronic equipment and storage medium
CN112162528B (en) Fault diagnosis method, device, equipment and storage medium of numerical control machine tool
CN114139747A (en) AIOps intelligent operation and maintenance system based on artificial intelligence technology
KR20200063343A (en) System and method for managing operaiton in trust reality viewpointing networking infrastucture
CN110413431B (en) Intelligent identification early warning method for large data platform fault
Gao et al. The diagnosis of wired network malfunctions based on big data and traffic prediction: An overview
EP3772834B1 (en) A method of predicting the time course of a plurality of data relative to a telephony infrastructure for network function virtualization
US11477070B1 (en) Identifying root causes of network service degradation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination