WO2023273637A1 - 一种故障检测方法及装置 - Google Patents

一种故障检测方法及装置 Download PDF

Info

Publication number
WO2023273637A1
WO2023273637A1 PCT/CN2022/092738 CN2022092738W WO2023273637A1 WO 2023273637 A1 WO2023273637 A1 WO 2023273637A1 CN 2022092738 W CN2022092738 W CN 2022092738W WO 2023273637 A1 WO2023273637 A1 WO 2023273637A1
Authority
WO
WIPO (PCT)
Prior art keywords
component
components
node
parent
child
Prior art date
Application number
PCT/CN2022/092738
Other languages
English (en)
French (fr)
Inventor
董凌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023273637A1 publication Critical patent/WO2023273637A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01RMEASURING ELECTRIC VARIABLES; MEASURING MAGNETIC VARIABLES
    • G01R31/00Arrangements for testing electric properties; Arrangements for locating electric faults; Arrangements for electrical testing characterised by what is being tested not provided for elsewhere
    • G01R31/08Locating faults in cables, transmission lines, or networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Definitions

  • the present application relates to the field of computer technology, in particular to a fault detection method and device.
  • the present application provides a fault detection method and device, which are used for locating components that may have faults, so as to provide maintenance guidance to users and improve maintenance efficiency.
  • the embodiment of the present application provides a fault detection method, which can be executed by a fault detection device, and the method can be applied to computer equipment.
  • the fault detection device obtains a component topology diagram, which is used to describe each component in the computer equipment and the connection relationship between each component; determine that in the component topology diagram, there is a connection with the first component reporting an error Whether other components of the relationship are likely to fail; outputs the second component that may fail, which is a subset of the other components and the first component.
  • the fault detection device can detect the components that may be faulty among the associated components of the first component based on the component topology diagram after detecting the fault of the first component, and output the fault detection result to provide maintenance guidance to the user . Since the second node may or may not have a fault sensor, the technical solution of the present application can improve maintenance efficiency without increasing hardware costs, and is applicable to a wider range of scenarios.
  • the component topology diagram is used to describe the hardware connection relationship between components using the same communication protocol.
  • outputting the second component includes: outputting the second component through a graphical interface; the graphical interface displays a component topology map, and the component topology includes a plurality of node identifiers, and the plurality of node identifiers are related to each component in the computer device One-to-one correspondence; the node identifier corresponding to the second component in the component topology diagram is highlighted; or the graphical interface displays the hardware physical map of each component of the computer device, the hardware physical map includes multiple controls, and the multiple controls are connected to the computer device Each component in , corresponds one-to-one, and each control is used to display the hardware of one component; the control corresponding to the second component in the physical picture of the hardware is highlighted.
  • the components that may fail can be displayed to the user more intuitively. Further, if the components that may fail can be displayed through the hardware physical map, it will be more convenient for the user to quickly determine the location of these hardware components that may fail, and improve User experience.
  • the second component is determined by a neural network model; wherein, the neural network model is used to determine whether other components that have a connection relationship with the component that reported the error may fail according to the component that reported the error, and whether the fault may occur The order of the failed components.
  • the neural network model here can continuously learn the rules for obtaining other components that may fail based on the fault reporting component based on the training data, as well as the ordering rules among multiple components that may fail.
  • the neural network model can adapt to different devices and application scenarios, and learn different detection rules and sorting rules, which is conducive to improving the accuracy of fault detection and has a wide range of applications.
  • the other components include an upstream component of the first component and a downstream component of the first component in the component topology diagram.
  • determining whether other components that have a connection relationship with the first component reporting an error in the component topology diagram may fail may occur, including: for any component in the other components, if there is at least one possible fault in the component If the next-level component of the fault is detected, it is determined that the component may be faulty.
  • the number of the second components is greater than 1, and outputting the second components specifically includes: sorting probabilities of failures of the multiple second components; and outputting the sorted multiple second components.
  • the nodes that are more likely to fail can be arranged in front by sorting, so as to guide the maintenance sequence to the user and improve the maintenance efficiency of the user.
  • the component set includes a parent component and one or more child components of the parent component; Sorting includes: if the parent component has no sensors and the number of one or more child components is greater than one, determining that the probability of failure of the parent component is greater than the probability of failure of the child component.
  • the component set includes a parent component and one or more child components of the parent component; Sorting, including: If the parent component has no sensors, and the number of child components is equal to 1, then determine that the probability of failure of the parent component is the same as the probability of failure of the child component.
  • the component set includes a parent component and one or more child components of the parent component; Sorting, including: if the parent component has a sensor, and the sensor of the parent component reports an error, the probability of failure of the parent component is greater than the probability of failure of the child component.
  • the component set includes a parent component and one or more child components of the parent component; Sorting, including: if the parent component has a sensor, and the sensor of the parent component does not report an error, and the number of child components is greater than 1, then determine that the probability of failure of the parent component is greater than the probability of failure of the child component.
  • the component set includes a parent component and one or more child components of the parent component; Sorting, including: if the parent component has a sensor, and the sensor of the parent component does not report an error, and the number of child components is equal to 1, then determine that the probability of failure of the parent component is less than the probability of failure of the child component.
  • outputting the second component includes: outputting the second component through a graphical interface; the graphical interface further includes a number used to indicate the sorting of the second component, and the number is located in a preset area.
  • the sorting result can be displayed for the user more intuitively, and the user experience can be improved.
  • the first component has a sensor; and the method further includes: determining that the first component has failed according to the sensor.
  • the embodiment of the present application also provides a fault detection device, which has the function of realizing the behavior in the method example of the first aspect above, and the beneficial effect can be referred to the description of the first aspect, and will not be repeated here.
  • the functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the fault detection device includes an acquisition module, a determination module and an output module. These modules can perform the corresponding functions in the method example of the first aspect above. For details, refer to the detailed description in the method example, and details are not repeated here.
  • the present application also provides a fault detection device, the fault detection device includes a processor and a memory, and may also include a communication interface, and the processor executes the program instructions in the memory to perform the above-mentioned first aspect or The method provided by any possible implementation of the first aspect.
  • the fault detection device may be an independent module in computer equipment, such as a baseboard manager controller (BMC).
  • BMC baseboard manager controller
  • the memory is coupled with the processor, and stores necessary program instructions and data during the fault detection process (such as storing component topology diagrams).
  • the communication interface is used for communicating with other devices.
  • the present application provides a computer-readable storage medium.
  • the computer-readable storage medium When the computer-readable storage medium is executed by a computing device, the computing device executes the aforementioned first aspect or any possible implementation of the first aspect. provided method.
  • the program is stored in the storage medium.
  • the storage medium includes but not limited to volatile memory, such as random access memory, and nonvolatile memory, such as flash memory, hard disk drive (hard disk drive, HDD), and solid state drive (solid state drive, SSD).
  • the present application provides a program product for a computing device
  • the program product for a computing device includes computer instructions, and when executed by a computing device, the computing device executes the aforementioned first aspect or any possible implementation of the first aspect method provided in the method.
  • the computer program product may be a software installation package, and if the method provided in the aforementioned first aspect or any possible implementation of the first aspect needs to be used, the computer program product may be downloaded and executed on a computing device. program product.
  • the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and implement the above first aspect and each possibility of the first aspect.
  • FIG. 1A is a schematic diagram of a possible system architecture provided by an embodiment of the present application.
  • FIG. 1B is a functional schematic diagram of a fault detection device 140 provided by an embodiment of the present application.
  • FIG. 2 is a schematic flow chart corresponding to the fault detection method provided in the embodiment of the present application.
  • FIG. 3 is a component topology diagram provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the main detection process in the fault detection method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of a non-inductive node detection process in the fault detection method provided by the embodiment of the present application.
  • Fig. 6 is a schematic diagram of the detection process of the sensing node not reporting a fault in the fault detection method provided by the embodiment of the present application;
  • FIG. 7 is a schematic diagram of the training and correction process based on the neural network model provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of an image interface provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another image interface provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of a third image interface provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of the hardware structure of some components in the computer device 10 provided by the embodiment of the present application.
  • FIG. 12 is a component topology diagram of a computer device 10 provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a fault detection device provided by the present application.
  • the fault detection method provided by the present application can be applied to computer equipment.
  • the method can detect the associated components of the component based on the component topology diagram, thereby discovering a series of components that may have faults, and Output these components that may be faulty to guide users to perform maintenance and improve maintenance efficiency.
  • the computer equipment in this application includes but not limited to: server, storage equipment, computing equipment, user equipment (user equipment, UE) and so on.
  • UE includes desktop computers, notebook computers, tablet computers, mobile phones, handheld devices, in-vehicle devices, wearable devices, and more.
  • the embodiment of the present application does not limit the type and structure of the computer device, and any device with electronic components is applicable to the embodiment of the present application.
  • FIG. 1A is a schematic structural diagram of a computer device 10 provided by an embodiment of the present application.
  • the computer device 10 includes a processor 110 , a memory 120 , an external memory 130 , a fault detection device 140 , and a bus 150 .
  • the processor 110 , the memory 120 , the external memory 130 and the fault detection device 140 are connected through a bus 150 .
  • the processor 110 may be a central processing unit (central processing unit, CPU), an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA), an artificial intelligence (artificial intelligence, AI) Chip, system on chip (system on chip, SoC) or complex programmable logic device (complex programmable logic device, CPLD), graphics processing unit (graphics processing unit, GPU), etc.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • AI artificial intelligence
  • SoC system on chip
  • CPLD complex programmable logic device
  • GPU graphics processing unit
  • the memory 120 refers to an internal memory directly exchanging data with the processor 110 , it can read and write data at any time, and the speed is very fast, and serves as a temporary data storage for the operating system or other running programs running on the processor 112 .
  • Memory includes volatile memory (volatile memory), such as random access memory (Random Access Memory, RAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), etc., can also include non-volatile memory (non- volatile memory), such as storage class memory (storage class memory, SCM), etc., or a combination of volatile memory and non-volatile memory, etc.
  • the external memory 130 which can also be referred to as an auxiliary memory, can be a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory, ROM), a hard disk drive (hard disk drive, HDD) or a solid-state drive (solid state disk, SSD) and so on.
  • non-volatile memory such as a read-only memory (read-only memory, ROM), a hard disk drive (hard disk drive, HDD) or a solid-state drive (solid state disk, SSD) and so on.
  • fault sensors are integrated in some components of the computer device 10 .
  • CPU, hard disks (such as HDD, SSD) in FIG. 1A all have their own fault sensors.
  • the fault sensor can be located inside the component to detect the running state of the component.
  • the running state includes normal operation and fault.
  • the fault sensor can indicate these two different states through different values. For example, a faulty sensor with a value of 1 indicates normal operation and a value of 0 indicates failure.
  • the failure sensor Upon detecting a component failure, the failure sensor generates a signal indicative of component failure (hereinafter referred to as a failure signal) to indicate the node failure.
  • a failure signal a signal indicative of component failure
  • the operating status indicator light on an electronic device is displayed as a green light when the device is operating normally, and as a red light when the device is operating abnormally.
  • the failure signal sent by the failure sensor of the component is referred to as component fault reporting.
  • the fault detection device 140 is a management subsystem running independently in the computer device 10, which can obtain fault signals of other components in the computer device 10 to execute the fault detection method provided in the embodiment of the present application.
  • the fault detection device 140 may be a new component integrated in the computer device 10, and the new component has the function of the fault detection method provided in the embodiment of the present application.
  • the fault detection device 140 may also be an existing component in the computer device 10 that has the function of the fault detection method provided by the embodiment of the present application, such as BMC, which is a key component of the server and is a separate component running in the server. management subsystem.
  • BMC As a platform management system, BMC has a series of monitoring and control functions. Its hardware is the first power-on component of the main board of the server and an out-of-band management system. The following describes the fault detection device 140 by taking the BMC as an example with reference to FIG. 1A and FIG. 1B .
  • the fault detection device 140 includes a processor 151, a memory 152, an external storage 153, and a communication interface 154, wherein the processor 151, the internal memory 152, the external storage 153, and the communication interface 154 are connected by a bus.
  • the processor 151 is configured to process and calculate data, for example, the processor 151 may run the fault detection method provided in the embodiment of the present application.
  • the processor 151 is similar to the processor 110, for example, the processor 151 may be a CPU, ASIC, FPGA, AI chip, SoC, CPLD, or GPU.
  • an operating system runs on the processor 151, and the operating system may be X86, Arm, UNIX, a lightweight system or a custom operating system, etc., which is not limited in this embodiment of the present application.
  • the operating system running on the processor 151 and the operating system running on the processor 110 are independent of each other. That is to say, when a component in the computer device 10 fails, such as the processor 110 fails, it will not affect the failure detection device 140 .
  • the memory 152 refers to an internal memory directly exchanging data with the processor 151. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system running on the processor 151 or other running programs.
  • the component topology diagram of the computer device 10 may be stored in the memory 152 , and the component topology diagram of the computer device 10 may be obtained from the memory 152 when the processor 151 executes the fault detection method provided in the present application.
  • the external memory 153 is used to provide storage resources, which can be non-volatile memory, such as ROM, HDD, SSD, flash memory, etc.
  • General BMC is to use flash particles to act as the function of the hard disk. The difference from memory is that the read and write speed of hard disk is slower than that of memory, and it is usually used to store data persistently.
  • the component topology diagram of the computer device 10 can also be persistently stored in the external storage 153, and when the processor 151 executes the fault detection method provided in this application, the component topology diagram can be migrated from the external storage 153 to the internal memory 152 , the processor 151 acquires the component topology map from the memory 152 .
  • the communication interface 154 is used for communicating with other components in the computer device 10 or devices outside the computer device 10 .
  • the processor 151 acquires fault signals generated by fault sensors of the processor 110 , the memory 120 , and the external memory 130 through the communication interface 154 .
  • the processor 151 may send the detection result to the display device through the communication interface 154 .
  • the display device is a user-side device, as shown in Figure 1B
  • the display device can be, for example, a web (page) display screen of the BMC, a mobile terminal device, such as a mobile phone, a tablet computer, etc., with specific software such as tool software, network management software, Cloud operation and maintenance software equipment, and various displays such as liquid crystal display (liquid crystal display, LCD), light emitting diode (light emitting diode, LED) screen, the embodiment of this application does not limit the display device, any device with a display screen All are applicable to the embodiments of this application.
  • the processor 151 can also communicate with other processors such as an artificial intelligence engine, etc., and the artificial intelligence engine can be deployed inside the computer device 10 or outside the computer device 10, and the artificial intelligence engine can be used to assist the fault detection device 140 to execute
  • an artificial intelligence engine etc.
  • the artificial intelligence engine can be deployed inside the computer device 10 or outside the computer device 10, and the artificial intelligence engine can be used to assist the fault detection device 140 to execute
  • the fault detection method provided by the embodiment of the present application will be introduced in detail below, and will not be repeated here.
  • FIG. 1A only shows a small number of components of the computer device 10 to keep it simple.
  • the computer device 10 may have more components than those shown in FIG. 1A.
  • the computer device 10 may also include a network card , motherboard, etc.
  • the computer device 10 may also have fewer components, and the embodiment of the present application does not limit the structure of the computer device 10 .
  • FIG. 2 is a schematic flowchart corresponding to the fault detection method provided by the embodiment of the present application.
  • the method can be executed by the fault detection device 140 (or the processor 151 ) in FIG. 1A , and is used for detecting components in the computer device 10 that may have faults.
  • the fault detection device 140 is assumed to be a BMC as an example in the following. That is to say, the BMC hereinafter can be replaced by the fault detection device 140 .
  • the method includes the following steps:
  • step 201 the BMC obtains the component topology diagram of the computer device 10 .
  • the component topology diagram of the computer device 10 is used to describe the components in the computer device 10 and the connection relationship between the components.
  • the connection relationship may be a logical connection relationship between components, or a physical connection relationship between components.
  • the component topology diagram of the software system is generated based on the logical connection relationship, while the component topology diagram of the hardware system is generated based on the physical connection relationship between components.
  • each node in the component topology diagram is used to represent a replaceable component, such as CPU, memory, hard disk, network card, communication cable, interface adapter card, etc. .
  • the component may be a component with a fault sensor or a component without a fault sensor.
  • CPU, memory, hard disk, etc. have fault sensors, and communication cables, interface adapter cards, etc. do not have fault sensors.
  • Nodes with faulty sensors in the component topology diagram are called sensed nodes, and nodes without faulty sensors are called non-sensing nodes as follows.
  • Each component in the component topology diagram can be represented by a node identifier, which is used to uniquely identify a component.
  • the node ID can be composed of one or more items such as numbers and letters.
  • the number of bits identified by a node can also indicate the level of the component in the component topology map. For example, a node identifier with a single digit indicates that the node is at the first level, and the node at the first level in the component topology diagram is the root node.
  • a node ID of two digits indicates that the node is at the second level
  • a node ID of three digits indicates that the node is at the third level, and so on.
  • the first level is the upper level of the second level
  • the second level is the upper level of the third level
  • the second level is the next level of the first level
  • the third level is the next level of the second level, and so on.
  • the upstream nodes of node A include all nodes passed by the path from the root node to node A, and the downstream nodes of node A include the child nodes of node A, the child nodes of node A's child nodes, etc., until the end node.
  • the associated nodes of node A include the upstream nodes of node A and the downstream nodes of node A.
  • FIG. 3 is a schematic diagram of a component topology diagram provided by an embodiment of the present application.
  • the node identifier of the root node is 0, and the node identifiers of the child nodes of the root node 0 are 00, 01, ..., 0i in sequence, where i takes a positive integer.
  • the child nodes of node 00 are 000, 001, 002, ..., 00j, where j takes a positive integer.
  • the parent node of node 0000 is 000
  • the child nodes of node 0000 include node 00000, and may also include node 00001, node 00002, etc. (not shown in FIG. 3 ).
  • the upstream nodes of node 0000 include node 000, node 00, and node 0.
  • the downstream nodes of node 0000 include node 00000 and so on.
  • the above-mentioned preparation method of the node identification is only an example. Any node identification that can uniquely represent a component is applicable to the embodiment of this application, and the embodiment of the application does not limit the function of the node identification. It is not necessary to express the level. It can indicate whether there is an association relationship between nodes by whether they contain the same character, or not.
  • the node identifier of the root node is abc
  • the node identifier of the child nodes of the root node is def, and so on. Since the above-mentioned preparation method of the node identifier is easy to understand and memorize, in the following, the description will continue based on the above-mentioned preparation method of the node identifier.
  • the component topology map is introduced above.
  • step 201 there are many ways for the BMC to obtain the component topology map. For example, it can be generated based on a configuration script imported or burned into the BMC.
  • the configuration script is used to describe the component topology map
  • the node identifiers of the included components and the connection relationship between nodes, the configuration script may also include the correspondence between components and node identifiers, and may also include other information of the nodes, such as whether they are sentient nodes or not.
  • step 202 the BMC detects that the first node reports failure.
  • the first node here can be any sensory node in the component topology diagram.
  • the BMC detects that the first node reports a fault, it means that the BMC obtains the fault signal generated by the fault sensor of the first node.
  • the BMC may detect a failure of the processor 110 , the memory 120 or the external memory 130 .
  • the BMC may detect that one or more nodes report faults.
  • the following uses one as an example to introduce.
  • step 203 the BMC detects other second nodes that may be faulty based on the component topology diagram.
  • the BMC When the BMC detects that the first node reports a failure, it searches for the associated nodes of the first node (including the upstream node of the first node and the downstream node of the first node) based on the component topology map, and detects whether there is a A node that may be faulty (such as a second node), it should be understood that the second node here represents a detected node that may be faulty, therefore, the number of the second node may be one or more, that is, this step may detect to one or more second nodes. Furthermore, those skilled in the art can know that even if a faulty sensor reports a fault, it is not necessarily a fault of the node itself.
  • each node that may have a fault is given a fault here. Failure probability, the similarities below will not be repeated. The cause of the failure can then be determined based on the failure probability.
  • the BMC when the BMC detects that the first node reports a failure, if the first node is the node mno, the BMC takes the node mno as the target node, and then traces upward based on the component topology diagram to detect whether the parent node mn of the target node mno is faulty.
  • different detections can be performed based on whether the parent node mn is a sensory node or a non-sensing node. If the parent node mn is a non-sensing node, the non-sensing node detection method provided by the implementation of this application is performed.
  • the parent node is a sensing node, continue to judge whether the parent node reports a fault, if not, execute the detection method that the sensing node does not report a fault provided by the implementation of this application. If the fault phenomenon related to the parent node is found through the above series of tests, it is considered that the parent node mn may also have a fault, and the possibility of the fault of the parent node mn is greater than that of the node mno, that is, the fault of the node mno may be caused by the parent node caused by node mn.
  • the parent node mn is a second node. Afterwards, the node mn is used as a new target node, and the above process is executed until all associated nodes of the node mno are detected.
  • FIG. 4 is a schematic diagram of the main detection process provided by the embodiment of the present application. As shown in Figure 4, the method includes the following steps:
  • step 401 it is detected that a node may be faulty, and if the node reports a fault, the node is used as a target node.
  • the highlighting here can be to record or mark the node as a node that may have a fault, or it can be to highlight the node in the graphical interface that contains the component topology map, and similar descriptions will not be repeated below.
  • x here is a reference value and can take any value.
  • highlighted nodes represent nodes that may have a fault (phenomenon)
  • unhighlighted nodes represent nodes that are considered not to have a fault.
  • Step 403 searching for the parent node of the target node based on the component topology graph.
  • Step 404 judging whether the parent node is a sensed node; if yes, then execute the sensed node detection process (see step 405 to step 408 for continued), otherwise, execute the non-sense node detection process (see the process shown in Figure 5) , jump to step 501.
  • Step 405 determine whether the parent node reported an obstacle, if yes, execute step 406, otherwise, execute the detection process that the sensing node does not report an obstacle (see the method flow shown in Figure 6), that is, jump to step 601.
  • the parent node reports failure, it is determined that the parent node may have a failure, and the failure of the child node may also be caused by the parent node, so the failure probability of the parent node is greater than the failure probability of the child node.
  • the embodiment of the present application is not limited to +1, and any algorithm that can represent the size between the two is applicable to the embodiment of the present application.
  • Step 407 judge whether the parent node is the root node, if yes, execute step 408, otherwise, use the parent node as a new target node, and return to execute step 403.
  • step 408 each node that may have a fault and the probability of failure of each node are obtained.
  • FIG. 5 is a schematic diagram of a non-sensing node detection process provided by an embodiment of the present application. As shown in Figure 5, the method includes the following steps:
  • Step 501 find the child nodes of the insensitive node based on the component topology graph.
  • Step 502 judge whether the number of child nodes of the insensitive node is ⁇ 1, if ⁇ 1, that is, the insensitive node has at least one child node, then perform step 503, if ⁇ 1, that is, the insensitive node has no child nodes, Execute step 509.
  • Step 503 traversing the child nodes [0, 1, 2...] of the insensitive node one by one.
  • the child nodes [0, 1, 2...] of the non-sensing node shown here are only for illustration, and it does not mean that the non-sensing node must have child nodes, nor does it indicate the bytes of the non-sensing node Points include at least 3.
  • the insensitive node may also have only 1 or 2 or more child nodes, which is not limited in this embodiment of the present application. The similarities will not be repeated below.
  • the traversal process of the non-sensing node child nodes will be executed multiple times, for example, through multiple threads in parallel execution, or through one thread in serial execution, when the last child node After the detection is completed, regardless of whether the last child node executes the detection process of the sensed node (step 503-step 506), or executes the detection process of the sensed node not reporting failure, when the detection of the last child node is completed, jump to step 507 .
  • a child node When traversing, a child node can be selected as the current child node according to the order of the node identifiers of the child nodes.
  • Step 504 judge whether the current child node is a sentient node, if yes, execute step 505, otherwise, return to execute step 501.
  • step 504 can be used to perform iterative detection.
  • the child nodes of node mno include node mno0 and node mno1.
  • node mno0 is traversed to determine whether node mno0 is a sentient node.
  • step 505 If so, continue to detect whether node mno0 reports (refer to step 505), if mno0 is a non-inductive node, return to step 501: detect the child nodes of node mno0, such as including node mno00 and node mno01, continue to execute the follow-up process, detect the number of child nodes of node mno00, etc., I won't repeat them in the future.
  • FIG. 5 shows a flow of steps of the same iteration, in other words, the current child node in step 505 is the same node as the current child node in step 504 .
  • Step 505 check whether the current child node reports an obstacle, if yes, execute step 506, otherwise, execute the detection process that the sensing node does not report an obstacle (see the method flow shown in Figure 6), that is, jump to step 601.
  • the current child node is the child node of the parent node of the target node in step 401, that is, the sibling node of the target node. Therefore, when the current child node reports failure, its failure probability is the same as the target node Nodes have an equal probability of failure.
  • Step 507 after the traversal of the child nodes of the insensitive node is completed, determine the number of highlighted child nodes among the child nodes [0, 1, 2...] of the insensitive node, and determine whether the number is ⁇ 1; if ⁇ 1, then execute step 508, if ⁇ 1, that is, none of the child nodes of the insensitive node is highlighted, then the insensitive node is not highlighted.
  • the failure of the highlighted child node may be caused by the failure of the non-sensing node, and therefore, the non-sensing node may also have a fault. Since the non-sensing node has no fault sensor, it is impossible to clearly define whether the non-sensing node has a fault. Therefore, its failure probability can be set to be equal to the failure probability of the child nodes.
  • the non-sensing node has at least two highlighted child nodes, the failure of the at least two highlighted child nodes is likely to be caused by the failure of the non-sensing node, therefore, the non-sensing node may also have a fault, And its failure probability is higher than that of child nodes.
  • FIG. 6 is a schematic flow diagram of a detection process in which a sensing node does not report a fault provided by an embodiment of the present application. As shown in Figure 6, the method includes the following steps:
  • Step 601 find the child nodes [0, 1 . . . ] of the unreported sensing node based on the component topology graph.
  • Step 602 judge whether the number of child nodes of the sentient node is ⁇ 1, if ⁇ 1, that is, the sentient node has at least one child node, then perform step 603, otherwise, that is, the sentient node has no child nodes, the sentient node Sensational nodes are not highlighted.
  • Step 603 traversing the child nodes [0, 1...] of the sentient node one by one.
  • the traversal process of the child nodes of the sensory node will be executed multiple times, and the traversal can be performed in parallel by multiple threads, or serially by one thread.
  • the sensing node detection process step 603-step 605
  • the non-sensing node detection process after the last child node detection is completed, jump to step 607.
  • a child node When traversing, a child node can be selected as the current child node according to the order of the node identifiers of the child nodes.
  • Step 604 judge whether the current child node is a sensed node, if yes, execute step 605, otherwise, execute the non-sense node detection process (see the method flow shown in FIG. 5), that is, jump to step 501.
  • Step 605 judge whether the current child node reports failure, if yes, execute step 606, otherwise, use the child node as a new unreported sensing node, and return to execute step 601.
  • the child nodes of node mn include node mn0 and node mn1.
  • node mn0 is traversed to determine whether node mn0 is a sentient node. If so, continue to detect whether node mn0 reports failure (see step 605), if node mn0 does not report a failure, return to step 601: detect the child nodes of node mn0, such as including node mn00 and node mn01, continue to execute the follow-up process, detect the number of child nodes of node mn00, etc., I won't repeat them in the future.
  • FIG. 6 shows a flow of steps of the same iteration, in other words, the current child node in step 606 is the same node as the current child node in step 605 .
  • step 506 Refer to the explanation of step 506, which will not be repeated here.
  • Step 607 after the traversal of the child nodes of the sensory node is completed, judge the number of highlighted child nodes among the child nodes [0, 1...] of the sensory node, and determine whether the number is ⁇ 1; if ⁇ 1, Then execute step 608, if ⁇ 1, that is, the number of highlighted child nodes is 0, then the affected node is not highlighted.
  • the sensing node Since the sensing node has a highlighted sub-node, that is, there is a sub-node with a fault phenomenon, the fault of the sub-node may be caused by the fault of the sensing node, therefore, even if the sensing node does not report If there is a failure, it can still be considered that there may be a failure, but its failure probability is lower than that of the child node reporting the failure.
  • the non-sensing node may also have a fault, And its failure probability is higher than that of child nodes.
  • the BMC may detect that multiple nodes report failures, and each node that reports failures can be used as the target node through the above process.
  • other highlighted nodes can be determined in the above process, and the BMC can use the highlighted node as a new target node, and continue to trace the nodes that may be faulty related to this node.
  • the above process is repeatedly executed, if the node has been detected and has a failure probability value, it is not necessary to repeat the detection. The following will be combined with embodiments for specific examples.
  • the BMC can detect one or more second nodes that may be faulty among the nodes associated with the first node based on the first node reporting the failure.
  • the first node and the one or more second nodes are referred to as faulty nodes as follows.
  • step 204 the BMC outputs a fault detection result.
  • the BMC may output faulty nodes whose failure probabilities exceed a preset threshold, that is, the fault detection results include faulty nodes whose failure probabilities exceed a preset threshold.
  • the preset threshold ⁇ 0.
  • the BMC can also sort the faulty nodes included in the fault detection result, and output the sorted multiple faulty nodes.
  • a fault sequence is represented as a sorted number of fault nodes as follows. This fault sequence can be used to indicate the maintenance sequence to the user. Since the failure of some nodes may be caused by other nodes, after a certain node is repaired, other faulty nodes may no longer fail, thereby improving the maintenance efficiency.
  • the BMC can sort the faulty nodes according to the sorting variables of the faulty nodes, so as to obtain the fault sequence.
  • the sorting variables include but not limited to the failure probability f1, the sorting variable f2 used to indicate the difficulty of maintenance, and the sorted variable f3 used to indicate the failure rate; wherein, the failure probability f1 of the faulty node is determined based on step 203; the sorting variable f2 It is used to indicate the difficulty of repairing and replacing nodes.
  • the value of the sorting variable f2 of each node can be preset. In practical applications, it can be based on the installation location, volume, working environment, and cost price of the hardware corresponding to the node. , maintenance price and other factors to determine.
  • the sorting variable f3 refers to the failure rate of the node.
  • the failure rate can be the inherent failure rate of the node obtained through experimental testing, or it can be counted during operation. If it is the inherent failure rate of the node, the ranking variable f3 of the node is the preset value, if it is the latter, the sorting variable f3 of the node can be changed. It should be noted that the ranking variables listed above are only examples, which are not limited in this embodiment of the present application, and any factors related to node failure or maintenance difficulty can be used as ranking variables.
  • each node can have one or more sorting variables
  • the BMC can sort the multiple faulty nodes according to the sorting variables of each faulty node, and several sorting methods are listed below.
  • Sorting method 1 Sort according to the probability of failure.
  • Sorting method two weight sorting method.
  • each sorting variable is given a preset weight value
  • the BMC can determine the fault comprehensive value of each fault node through the following formula 1:
  • y represents the fault comprehensive value
  • fi represents the sorting variable fi
  • wi represents the weight value of the sorting variable fi.
  • i takes a positive integer.
  • each faulty node may have one or more sorting variables, and the sorting variables included in each faulty node may be the same, different, or not completely the same, for example, the sorting variables of node mn include f1, f2, node
  • the ranking variables of m include f1 and f3, which are not limited in this embodiment of the present application.
  • BMC calculates the fault comprehensive value of each faulty node through the above formula 1, and sorts the fault comprehensive value of each faulty node from large to small to obtain the fault sequence.
  • Sorting method three priority sorting method.
  • each sorting variable is given a preset priority, or priority order, and the BMC can sort multiple If there are multiple nodes with equal values, you can continue to sort by the value of the sort variable with the second priority, and so on.
  • the priority order of f1, f2, and f3 is: f1>f2>f3
  • BMC can first sort multiple faulty nodes according to the value of each f1, if there are multiple nodes with the same value of f1, continue Sorting is performed according to the value of f2 of the plurality of nodes, and so on until all nodes are sorted.
  • the ordering variables of node m include: f1 (value 0.8), f2 (value 0.6), f3 (value 0.2);
  • the ordering variables of node mn include: f1 (value 0.2), f2 (value 0.4);
  • the ordering variables of node mn0 include: f1 (value 0.6), f3 (value 0.1);
  • the ranking variables of node mn01 include: f1 (value 0.2), f2 (value 0.9).
  • BMC first sorts according to the value of f1 with the highest priority, and it can be determined that m>mn0. Since the f1 values of mn and mn01 are equal, it can continue to sort according to the value of f2 with the second priority, and continue to determine the fault sequence It is: m>mn0>mn01>mn. It should be noted that the above numerical values are only examples and do not represent logical possibilities.
  • Sorting method four sorting based on the neural network model.
  • BMC can combine the neural network model to assist in decision-making and training and correction of fault sequences. That is to say, it can be used to determine the fault sequence, and can also perform training and correction on the above-mentioned determined fault sequence.
  • the fault sequence determined by the BMC is: m>mn0>mn01>mn.
  • the fault sequence can be output to the user as a fault detection result, so as to guide the maintenance sequence to the user. For example, the user first checks node m based on the above fault sequence. If the faults of other nodes are resolved after the repair of node m, it is confirmed that the cause of the fault is node m.
  • the fault sequence continues For a node mn0, in the same way, if the fault is resolved after the inspection of node mn0, it is confirmed that the cause of the fault is node mn0, if the fault is not resolved, continue to repair the next one, and so on.
  • FIG. 7 shows a schematic flow diagram of the training correction, and its flow includes:
  • Step 701 select the first node in the failure sequence.
  • Step 702 based on the detection result, it is judged whether the faults of other nodes are resolved after the maintenance of the node is completed, and if it is resolved, then step 703 is executed, otherwise, step 704 is executed.
  • Step 703 the confidence of the node is +1.
  • Step 704 the confidence of the node is -1.
  • step 705 determine whether the node is the last node in the failure sequence, if yes, end the process, otherwise, execute step 706.
  • Step 706, sequentially select the next node in the failure sequence, and return to step 702.
  • the first node in the fault sequence is node m; if the inspection result indicates that node m is the cause of the fault, then add 1. The confidence of other nodes remains unchanged; for another example, if the inspection result indicates that node mn0 is the cause of the fault, that is, the order of the above fault sequence is wrong, then the confidence of node m in this scenario is reduced by 1, and the confidence of node mn0 plus 1.
  • the inspection result indicates that node mn01 is the cause of the failure
  • the confidence of node m in the scene is reduced by 1
  • the confidence of node mn0 is reduced by 1
  • the confidence of node mn01 is increased by 1
  • the confidence of other nodes remains unchanged. So on and so forth.
  • the scenario referred to here includes two conditions, 1) the node reporting the failure is detected first, that is, the node reporting the failure detected in step 401; 2) the fault sequence obtained based on the fault detection triggered by the node reporting the failure.
  • node mn01 fails, and the failure sequence is m>mn0>mn01>mn, which is a complete scenario. This is because different nodes report faults, and the obtained fault sequences may be the same, but in fact the fault sequences after training and correction may be different.
  • the fault sequence is m>mn0>mn01> mn. So the scenario includes faulty nodes that trigger fault detection.
  • the neural network model can determine the confidence of the position of each node in the fault sequence. If the value of the confidence of the node exceeds the first preset value after training and correction, the position of the node will be forwarded to or, if it is lower than the second preset value, move the position of the node backward. For example, after many times of training and correction, it is determined that the confidence of node m is lower than the second preset value, then node m is moved behind node mn0, and the corrected fault sequence mn0>m>mn01>mn is obtained; it can be understood that , the confidence value can represent the position of the node, the greater the confidence, the higher the position in the fault sequence.
  • the BMC can determine the fault sequence in this scenario by combining the fault sequence generated by the above method and the neural network model, and determine the fault sequence to be finally output to the user.
  • the fault sequences generated by sorting methods 1 to 3 are called the first fault sequence
  • the fault sequence determined by the neural network model is called the second fault sequence.
  • the fault sequence that is finally output to the user is called the target fault sequence. If the first fault sequence is the same as the second fault sequence, the target fault sequence is either the first fault sequence or the second fault sequence. If the first fault sequence is different from the second fault sequence, the target fault sequence is the second fault sequence.
  • the BMC may also use the neural network model alone to determine the target fault sequence.
  • the neural network model can be deployed in the BMC or other processors, for example, FPGA, and the BMC can communicate with the processor to obtain the fault sequence determined by the neural network model.
  • Step 204 The BMC outputs a fault detection result.
  • the BMC can display the fault detection result to the user through a graphical interface, so as to guide the user to the components that need to be repaired and the repair sequence.
  • the BMC can generate an image including the component topology according to the component topology.
  • the image including the component topology means that the image contains the control of each node in the component topology, and the control and the node identifier are one by one correspond.
  • the control of the faulty node in the image can be located, and the control can be highlighted.
  • the highlighting is also an indication, and other methods can also be used to distinguish faulty nodes from non-faulty nodes, such as text, different colors, and whether to flash lights, which is not limited in this embodiment of the present application. If in the process of FIG. 4 to FIG. 6 , if the BMC has generated the image, and the fault node is highlighted in the image, the BMC can directly use the image.
  • BMC can also perform post-processing on the image based on the fault sequence, such as concatenating the highlighted nodes to form a fault path, and assigning a number to the fault node based on the fault sequence, which can be used to indicate the position of the node in the fault sequence. For example, in the fault sequence m>mn0>mn01>mn, the number of node m is 1, the number of node mn0 is 2, the number of node mn01 is 3, and so on.
  • FIG. 8 is a schematic diagram of an image interface provided by an embodiment of the present application. As shown in FIG. 8 , the graphical interface displays the component topology shown in FIG. 3 , as well as the fault path and the number of each fault node based on the component topology.
  • the component topology diagram in the above graphical interface may also be replaced with a hardware physical diagram corresponding to the component topology diagram.
  • FIG. 9 is a schematic diagram of another image interface provided by the embodiment of the present application.
  • the graphic interface displays the fault detection results on the physical map of the hardware of the computer device 10 .
  • each control in the physical hardware diagram represents the hardware of a component, and the control is bound to the node ID of the component.
  • the control and the node ID correspond one-to-one. According to the node ID of the faulty node and the corresponding relationship, it can be Locate to the control in the hardware mockup.
  • FIG. 9 and FIG. 8 is that the control used to indicate the node ID is replaced with the control used to indicate the physical hardware corresponding to the node ID.
  • Figures 8 to 9 are only examples, and the image interface of the embodiment of the present application may have more or less information than that in Figure 8 or Figure 9, such as the name of the faulty node, IP Other information such as address and failure time are not limited in this embodiment of the present application.
  • the above-mentioned method of displaying the fault detection result through the image interface is only an example.
  • the application can also display the fault detection result in other ways, as shown in FIG. , the fault detection result can also be displayed by means such as video, animation, voice, etc., which is not limited in the embodiment of the present application, and any method that can display the fault detection result is applicable to the embodiment of the present application.
  • the BMC can also send the fault detection result to other devices or components, such as the processor 110, or a display device with computing power, and these devices or components will generate an image for representing the fault detection result in the manner executed by the above-mentioned BMC, such that , which can reduce the requirement on the computing power of the BMC.
  • the image is generated by a device other than the BMC, the device should have the same component topology map as the BMC. For example, it can be sent to the device by the BMC, or it can be imported by the user in other ways, and this is not discussed here. Do limited.
  • the BMC can detect the second node that may be faulty among the associated nodes of the first node based on the component topology diagram, and output the fault detection result to provide maintenance guidance to the user . Since the second node may or may not have a fault sensor, the technical solution of the present application can improve maintenance efficiency without increasing hardware costs, and is applicable to a wider range of scenarios.
  • the computer device 10 shown in FIG. 1A is taken as an example to illustrate the fault detection method provided by the embodiment of the present application.
  • the processor 110 the memory 120 , the external memory 130 and the fault detection device 140 in the computer device 10 are connected through the bus 150 .
  • the bus 150 is introduced as follows:
  • Bus 150 including but not limited to: double data rate (double data rate, DDR) bus, fast peripheral component interconnect express (PCIe) bus, serial connection SCSI (serial attached scsi, SAS) bus, serial Advanced technology attachment (serial advanced technology attachment, SATA) bus, etc.
  • DDR double data rate
  • PCIe fast peripheral component interconnect express
  • SCSI serial attached scsi
  • SAS serial attached scsi
  • SATA serial advanced technology attachment
  • the DDR bus is faster than the PCIe bus, and the PCIe bus is faster than the SAS bus and SATA bus.
  • the processor 110 and the memory 120 are connected through a DDR bus.
  • the processor 110 and the fault detection device 140 may be connected through a PCIe bus.
  • the processor 110 and the external memory 130 may be connected via a SATA bus or a SAS bus.
  • the internal connection of the computer device 10 may be more complicated, which will be described in detail below.
  • FIG. 11 shows a physical connection method of the computer device 10 shown in FIG. 1A .
  • the main board is also called the main board, which is the core of the computer hardware system, and the components in the computer device 10 are connected through the main board.
  • the motherboard is a printed circuit board (PCB), with CPU slots, memory slots, and other slots (such as graphics card slots) on the motherboard.
  • the processor 110 can be installed on the CPU slot of the motherboard, and the memory 120 can be installed on the memory slot of the motherboard.
  • the connection between the slots is implemented inside the motherboard through a bus (such as a DDR bus, a PCIe bus, etc.).
  • the CPU socket and the DDR socket can be connected through a DDR bus, so as to realize the connection between the processor 110 and the memory 120 .
  • USB interface universal serial bus
  • PCIe interface Peripheral Component Interconnect Express interface
  • PCIe riser PCIe interface riser
  • PCIe riser is the transfer interface of the PCIe interface on the motherboard. In hardware, the PCIe riser has two interfaces, both of which are PCIe interfaces.
  • the front-end interface is connected to the PCIe interface on the motherboard, and the back-end interface can be connected to other Components with PCIe interfaces are connected to realize the switching function.
  • both ends are PCIe interfaces
  • the back-end interface can be adapted to components with different interface forms or different installation methods, and can have multiple back-end interfaces to access multiple components with PCIe interfaces.
  • the PCIe riser is used for data transmission and does not have data processing functions, similar to the role of communication cables.
  • the reading and writing speed of the hard disk is slower than that of the memory.
  • the memory is directly connected to the processor 110, while the hard disk is usually connected through the SAS bus or The SATA bus is indirectly connected to the processor 110 .
  • the hard disk has a memory interface such as a non-volatile memory host controller (non-volatile memory express, NVMe) interface: the NVMe interface can also be directly connected to the processor 110 through the PCIe bus, which can improve the read and write speed of the hard disk. But performance is still lower than memory.
  • NVMe non-volatile memory express
  • the hard disk in the indirect access mode, usually needs to be connected to the processor 110 by means of some components such as disk arrays (redundant arrays of independent disks, RAID) and PCIe riser, wherein the RAID has a protocol conversion function, exemplary , RAID has a SAS interface and a PCIe interface, receives SAS messages through the SAS interface, receives PCIe messages through the PCIe interface, and can convert SAS messages and PCIe messages to each other to realize communication between devices on both sides.
  • RAID redundant arrays of independent disks
  • PCIe riser wherein the RAID has a protocol conversion function
  • RAID has a SAS interface and a PCIe interface, receives SAS messages through the SAS interface, receives PCIe messages through the PCIe interface, and can convert SAS messages and PCIe messages to each other to realize communication between devices on both sides.
  • the following SAS bus and HDD are taken as examples to introduce the connection mode between the hard disk and the processor.
  • HDDs are usually inserted into a slot (slot) of the HDD backplane (bacplane), and each slot of the HDD backplane (also referred to as The interface connector (connector, CNN)) is used to access an HDD, and the number of slots determines the number of hard disks that can be integrated into the external memory.
  • One end of the CNN is connected to the HDD, and the other end is connected to the SAS interface of the RAID through a SAS cable (such as SAS CABLE).
  • the front-end interface of the riser is connected to the inherent PCIe interface of the motherboard, and thus the connection between the HDD and the processor is realized.
  • SAS CABLE is usually 1*4 type, that is, one SAS CABLE can connect 4 HDDs to RAID in parallel. It should be noted that these 4 HDDs are independent of each other. Different from the welding wire on the motherboard, the SAS CABLE is a replaceable independent cable, and its damage may cause HDD failure. It is understandable that for the 1*4 type SAS CABLE, no matter which HDD's SAS cable is damaged, the SAS CABLE needs to be replaced. Therefore, the SAS CABLE is one component and not four components.
  • the component topology diagram is used to represent the physical connection relationship between components, the components constituting the component topology diagram may be connected using the same bus protocol. That is to say, a component topology diagram cannot contain two or more buses with different attributes.
  • A, B, C, and D are interconnected through a SAS bus, and E, F, and G are interconnected through a PCIe bus.
  • A, B, C, and D belong to the same component structure diagram, but they do not belong to the same component structure diagram as E, F, and G.
  • E, F, and G can form another component topology diagram.
  • FIG. 12 is a component topology diagram of the computer device 10, which is used to describe the connection relationship of HDD, HDD backplane, SAS CABLE, and RAID in the computer device 10.
  • each SAS CABLE is 1*4
  • each RAID includes 8 SAS channels. That is, each RAID can have at least two SAS CABLEs.
  • the component topology diagram shown in FIG. 12 shows component names, which may actually be node identifiers.
  • Scenario 1 Assume that HDD2 reports a failure.
  • step 401 HDD2 is detected to report failure, and HDD2 is used as the target node.
  • Step 403 Determine the parent node of the target node (HDD2), that is, the CNN2 of the HDD backplane (hereinafter referred to as CNN2) based on the component topology diagram.
  • CNN in the following refers to the CNN on the HDD backplane.
  • Step 404 judge whether the parent node CNN2 is a sensory node, CNN2 is a non-sensing node, and execute step 501 (non-sensing node detection process).
  • Step 501 look up the child node of CNN2, ie HDD2, based on the component topology map.
  • Step 502 determine whether the number of child nodes of CNN2 (namely HDD2) is ⁇ 1, and since CNN2 has 1 child node, step 503 is executed.
  • Step 503 traverse the child node HDD2 of the CNN2.
  • step 504 the child node HDD2 is a sentient node, and step 505 is executed.
  • step 505 the child node HDD2 reports failure, and step 506 is executed.
  • the node may have been highlighted during the detection process of upward or downward tracing. If the node has been highlighted, it is not necessary to repeat the highlighting, that is When traversing a node, if the node is highlighted, the traversal may not be repeated, that is, the above steps 504 to 506 may not be executed.
  • the BMC may only traverse nodes that are not highlighted. In another possible implementation manner, the BMC may record the traversed nodes (including traversed but not highlighted nodes), and the traversed nodes do not need to be traversed repeatedly.
  • step 507 after traversal of the sub-nodes of CNN2 is completed, determine the number of highlighted sub-nodes in the sub-nodes of CNN2, that is, one (HDD2), and execute step 509.
  • BMC uses CNN2 as a new target node, repeats the above process, and traces upward whether the parent node of CNN2 is faulty. See the following process:
  • step 401 the BMC takes CNN2 as a target node.
  • Step 403 look up the parent node SAS CABLE1 of CNN2 based on the component topology map.
  • Step 404 judging whether the parent node SAS CABLE1 is a sensitive node, since SAS CABLE1 is a non-sensing node, continue to execute step 501.
  • Step 501 Find the child nodes of SAS CABLE1 based on the component topology map, and its child nodes include CNN1, CNN2, CNN3 and CNN4.
  • Step 502 judge whether the number of child nodes of SAS CABLE1 is ⁇ 1, the number of child nodes of SAS CABLE1 is 4, so step 503 is executed.
  • Step 503 traverse CNN1, CNN2, CNN3 and CNN4 one by one.
  • Step 504 first select CNN1, and judge whether CNN1 is a sensory node, and return to step 501 because CNN1 is a non-sense node.
  • Step 501 look up the child node of CNN1, ie HDD1, based on the component topology map.
  • Step 503 traverse HDD1.
  • step 504 it is judged whether HDD1 is a sense node; if HDD1 is a sense node, step 505 is executed.
  • Step 505 check whether HDD1 reports a failure, based on the above-mentioned assumed scenario one is available, HDD1 does not report a failure, and executes step 601 (sensing node does not report a failure process).
  • Step 601 find the child nodes of HDD1 based on the component topology graph.
  • step 602 it is determined whether the number of child nodes of HDD1 is greater than or equal to 1. Since HDD1 has no child nodes, that is, the number of child nodes is 0, step 609 is executed.
  • Step 609 HDD1 is not highlighted.
  • Step 507 judging whether the number of highlighted child nodes in CNN1 is ⁇ 1. Since CNN1 has no highlighted child nodes, the number is 0, so CNN1 is not highlighted.
  • CNN2, CNN3 and CNN4 are traversed sequentially. Among them, CNN1, CNN3, and CNN4 are not highlighted, and the above 1) process shows that CNN2 is highlighted.
  • step 507 of this iteration of SAS CABLE1 is executed.
  • Step 507 determine the number of highlighted child nodes in the child nodes of SAS CABLE1, only CNN2 is highlighted, the number is 1, therefore, after the judgment of step 508, execute step 509.
  • step 401 the BMC takes SAS CABLE1 as the target node.
  • Step 403 searching for the parent node RAID1 of SAS CABLE1 based on the component topology diagram.
  • step 404 it is judged whether the parent node RAID1 is a sense node, and since RAID1 is a sense node, continue to execute step 405.
  • step 405 it is judged whether RAID1 reported a failure. Based on the above-mentioned assumed scenario 1, RAID1 did not report a failure, and then step 601 (process of sensing node not reporting a failure) was executed.
  • Step 601 searching for child nodes of RAID1, namely SAS CABLE1 and SAS CABLE2, based on the component topology diagram.
  • step 602 the number of child nodes of RAID1 is 2, and step 603 is executed.
  • Step 603 traversing the child nodes SAS CABLE1 and SAS CABLE2 of RAID1 one by one.
  • SAS CABLE1 has been highlighted and has been traversed (see above step 501 to step 509), therefore, only SAS CABLE2 can be traversed here.
  • Step 604 judging whether SAS CABLE2 is a sensory node, since SAS CABLE2 is a non-sensing node, execute step 501 (ie non-sensing node detection process).
  • Step 501 Find the child nodes of SAS CABLE2, namely CNN5-CNN8, based on the component topology map.
  • Step 502 the number of child nodes of SAS CABLE2 is 4, and step 503 is executed.
  • Step 503 traversing child nodes CNN5-CNN8 of SAS CABLE2 one by one.
  • step 504 CNN5 is the first, and CNN5 is a non-sensing node, and the execution returns to step 501.
  • Step 501 look up the child node of CNN5, ie HDD5, based on the component topology map.
  • Step 503 traverse HDD5.
  • step 504 HDD5 is a sensing node, and step 505 is executed.
  • step 505 based on the assumption that scenario 1 is available, the HDD5 does not report a failure, and executes step 601 (that is, the sensing node does not report a failure process).
  • Step 601 find the child nodes of HDD5 based on the component topology graph.
  • Step 602 HDD5 has no child nodes, and HDD5 is not highlighted.
  • Step 507 judging the number of highlighted child nodes among the child nodes of CNN5, since CNN5 has only one child node HDD5, and the HDD5 is not highlighted, that is, the number of highlighted child nodes of CNN5 is 0, Then the CNN5 is not highlighted.
  • CNN6, CNN7 and CNN8 are traversed sequentially. Refer to the above-mentioned process of traversing CNN5, which will not be repeated here. Among them, based on the scenario 1 of the aforementioned assumptions, since HDD6-HDD8 have not reported failures, CNN6 and CNN7 , CNN8 are not highlighted.
  • Step 507 determine the number of highlighted child nodes of SAS CABLE2, since CNN5-CNN8 are not highlighted, that is, the number is 0, therefore, SAS CABLE2 is not highlighted.
  • Step 607 determine the number of highlighted child nodes among the child nodes of RAID1. Since SAS CABLE1 is highlighted and SAS CABLE2 is not highlighted, the number is 1, and step 608 is executed.
  • Scenario 2 Assume that HDD1 and HDD2 report failures.
  • BMC takes CNN2 as a new target node, and determines the parent node SAS CABLE1 of CNN2 based on the component topology diagram.
  • the SAS CABLE1 is a non-sensing node, and performs the non-sensing node detection process: find the child node of the SAS CABLE1 based on the component topology diagram , CNN1 ⁇ CNN4, traverse its child nodes one by one, first CNN1, since CNN1 is a non-sensing node, execute the non-sensing node detection process: find the sub-node HDD1 of CNN1, HDD1 is a sensing node, and judge whether HDD1 reports an obstacle, based on the foregoing
  • the BMC can continue to traverse CNN3, which is a non-sensing node, and search for the child node HDD3 of CNN3, which is a sensing node, and based on the aforementioned scenario 2, it can be known that HDD3 has not reported a fault. Therefore, HDD3 is not highlighted, and the child node traversal of CNN3 is completed, and it is determined that the number of highlighted child nodes among the child nodes of CNN3 is 0, so CNN3 is also not highlighted. Continue to traverse CNN4, similarly, CNN4 is not highlighted.
  • the parent node of RAID1 is a sensor node, and check whether RAID1 reports a failure.
  • RAID1 has not reported a failure, and execute the process of detecting that the sensor node did not report a failure: Search The child nodes of RAID1, that is, SAS CABLE1 and SAS CABLE2, traverse the child nodes of RAID1, first of all, SAS CABLE1, since SAS CABLE1 has been traversed, here you can continue to traverse SAS CABLE2, because in scenario 2, it is also assumed that HDD5 ⁇ HDD8 are not reported Therefore, the process of traversing SAS CABLE2 can refer to the description of the above scenario 1, so I won’t go into details this time. It can be seen from the above process that SAS CABLE2 is not highlighted.
  • HDD1 and HDD2 belong to the same parent node, it is not necessary to repeatedly use HDD2 as the target node to repeat upward source detection. If HDD2 and HDD1 are not associated, the process is similar to that of HDD1.
  • the new target node traces upward to detect whether there is a node related to HDD2 that may be faulty. For the specific process, please refer to the above-mentioned detection process with HDD1 as the target node, which will not be repeated here.
  • HDD1 and HDD2 report faults
  • the highlighted nodes include HDD1, HDD2, CNN1, CNN2, SAS CABLE1, and RAID1.
  • Scenario 3 Assume that HDD1, HDD2, and RAID1 report a failure.
  • the highlighted nodes include HDD1, HDD2, CNN1, CNN2, SAS CABLE1, and RAID1.
  • the sensory node in the upstream node does not report an obstacle, and the number of highlighted child nodes of the sensory node is 1, the possibility of the sensory node is lower than the target node.
  • the sensing node in the upstream node does not report an obstacle, and the number of highlighted child nodes of the sensing node is 0, the sensing node is not highlighted.
  • the embodiment of the present application further provides a fault detection device, which is used to implement the method executed by the BMC in the above method embodiment.
  • the fault detection device 1300 includes an acquisition module 1301 , a determination module 1302 and an output module 1303 .
  • the modules are connected through communication channels.
  • the obtaining module 1301 is used to obtain a component topology map, which is used to describe each component in the computer device and the connection relationship between each component; for the specific implementation, please refer to the description of step 201 in FIG. repeat.
  • Determination module 1302 used to determine whether other components connected to the first component that reported the error in the component topology diagram may fail; for specific implementation methods, please refer to the description of steps 202 to 203 in FIG. 2 , which will not be repeated here repeat.
  • the output module 1303 is configured to output the second component that may fail, where the second component is a subset of other components and the first component.
  • the second component is a subset of other components and the first component.
  • the component topology diagram is used to describe the hardware connection relationship between components using the same communication protocol.
  • the output module 1303 is specifically configured to output the second component through a graphical interface; wherein, the graphical interface displays a component topology diagram, and the component topology includes multiple node identifiers, and the multiple node identifiers are related to each component in the computer device One-to-one correspondence; the node identifier corresponding to the second component in the component topology diagram is highlighted; or the graphical interface displays the hardware physical map of each component of the computer device, the hardware physical map includes multiple controls, and the multiple controls are connected to the computer device Each component in , corresponds one-to-one, and each control is used to display the hardware of one component; the control corresponding to the second component in the physical picture of the hardware is highlighted.
  • the second component is determined through a neural network model; where the neural network model is used to determine whether other components that have a connection relationship with the component that reported the error may fail according to the component that reported the error, and whether the failure may occur The ordering of components.
  • other components include upstream components of the first component and downstream components of the first component in the component topology diagram.
  • the determining module 1302 is specifically configured to, for any one of the other components, determine that the component may be faulty if there is at least one lower-level component that may be faulty.
  • the number of second components is greater than 1;
  • the determining module 1302 is also used to rank the probabilities of failures of multiple second components; refer to the descriptions in FIG. 4 to FIG. 7 , and details will not be repeated here.
  • the output module 1303 is also used to output the sorted multiple second components.
  • the component set includes a parent component and one or more child components of the parent component;
  • the determining module 1302 is specifically configured to, if the parent component does not have a sensor, and the number of one or more child components is greater than 1, then determine that the failure probability of the parent component is greater than the failure probability of the child component.
  • the component set includes a parent component and one or more child components of the parent component;
  • the determining module 1302 is specifically configured to, if the parent component has no sensor and the number of sub-components is equal to 1, determine that the failure probability of the parent component is the same as the failure probability of the sub-component.
  • the component set includes a parent component and one or more child components of the parent component;
  • the determining module 1302 is specifically configured to, if the parent component has a sensor, and the sensor of the parent component reports an error, then determine that the failure probability of the parent component is greater than the failure probability of the child component.
  • the component set includes a parent component and one or more child components of the parent component;
  • the determining module 1302 is specifically configured to determine that the probability of failure of the parent component is greater than the probability of failure of the child component if the parent component has a sensor, and the sensor of the parent component does not report an error, and the number of child components is greater than 1.
  • the component set includes a parent component and one or more child components of the parent component;
  • the determining module 1302 is specifically configured to determine that the probability of failure of the parent component is less than the probability of failure of the child component if the parent component has a sensor, and the sensor of the parent component does not report an error, and the number of child components is equal to 1.
  • the output module 1303 is specifically configured to output the second component through a graphical interface; wherein, the graphical interface further includes a number used to indicate the sorting of the second component, and the number is located in a preset area.
  • the first component has a sensor; the determining module 1302 is further configured to determine that the first component has failed according to the sensor.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.
  • a magnetic medium such as a floppy disk, a hard disk, or a magnetic tape
  • an optical medium such as a DVD
  • a semiconductor medium such as a solid state disk (Solid State Disk, SSD)
  • the various illustrative logic units and circuits described in the embodiments of the present application can be implemented by a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, Discrete gate or transistor logic, discrete hardware components, or any combination of the above designed to implement or operate the described functions.
  • the general-purpose processor may be a microprocessor, and optionally, the general-purpose processor may also be any conventional processor, controller, microcontroller or state machine.
  • a processor may also be implemented by a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration to accomplish.
  • the steps of the method or algorithm described in the embodiments of the present application may be directly embedded in hardware, a software unit executed by a processor, or a combination of both.
  • the software unit may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other storage medium in the art.
  • the storage medium can be connected to the processor, so that the processor can read information from the storage medium, and can write information to the storage medium.
  • the storage medium can also be integrated into the processor.
  • the processor and storage medium can be provided in an ASIC.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种故障检测方法及装置,该方法可以由计算机设备(10)中的故障检测装置(140)执行,在该方法中,故障检测装置(140)获取组件拓扑图(201),确定组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障(203);输出可能发生故障的第二组件(204),第二组件是其他组件和第一组件的子集。上述方式,可以在检测到计算机设备(10)内的组件报障时,基于组件拓扑图检测该组件的关联组件是否可能发生故障,从而发现可能存在故障的一系列组件,并输出这些可能存在故障的组件,以指导用户进行检修,提高检修效率。

Description

一种故障检测方法及装置
相关申请的交叉引用
本申请要求在2021年06月30日提交中华人民共和国知识产权局、申请号为202110732299.X、申请名称为“一种故障检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种故障检测方法及装置。
背景技术
随着电子技术的发展,计算机设备中包含的组件越来越多,组件之间的连接关系也越来越复杂。其中有些组件自带有故障传感器,可以在检测到组件可能存在故障时进行报警,然而大部分组件没有故障传感器,并且由于组件之间可能会相互影响,当某个组件出现故障时,还可能引起其他组件故障。
当前出于成本以及产品实现的角度,无法为每个组件配置故障传感器,当组件出现故障时,要定位到哪些组件可能存在故障的难度也越来越高。
发明内容
本申请提供一种故障检测方法及装置,用于对可能存在故障的组件进行定位,以向用户提供检修指导,提高检修效率。
第一方面,本申请实施例提供了一种故障检测方法,该方法可以由故障检测装置执行,在该方法可以应用于计算机设备中。在该方法中,故障检测装置获取组件拓扑图,该组件拓扑图用于描述计算机设备中的各个组件,以及各个组件之间的连接关系;确定组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障;输出可能发生故障的第二组件,第二组件是其他组件和第一组件的子集。
通过上述方法,故障检测装置可以在侦测到第一组件报障后,基于组件拓扑图检测出第一组件的关联组件中可能存在故障的组件,并输出故障检测结果,以向用户提供检修指导。由于第二节点可以具有故障传感器也可以不具有故障传感器,因此本申请技术方案可以在不增加硬件成本的基础上,提高检修效率,适用场景也更广泛。
在一种可能的实施方式中,组件拓扑图用于描述使用同一通信协议的组件之间的硬件连接关系。
通过上述方法,使用同一通信协议的组件之间的交互更加频繁,更易发现可能发生故障的组件,能够提高故障检测效率。
在一种可能的实施方式中,输出第二组件包括:通过图形界面输出第二组件;图形界面显示有组件拓扑图,组件拓扑包括多个节点标识,多个节点标识与计算机设备中的各个组件一一对应;在组件拓扑图中与第二组件对应的节点标识被高亮显示;或图形界面显示计算机设备的各个组件的硬件实物图,硬件实物图包括多个控件,多个控件与计算机设备 中的各个组件一一对应,每个控件用于显示一个组件的硬件;在硬件实物图中与第二组件相对应的控件被高亮显示。
通过上述方法,能够更加直观地为用户展示可能发生故障的组件,进一步,如果通过硬件实物图来展示可能发生故障的组件,则能够更加方便用户快速确定这些可能发生故障的硬件组件的位置,提高用户使用体验。
在一种可能的实施方式中,第二组件是通过神经网络模型确定的;其中,神经网络模型用于根据报错的组件确定与报错的组件具有连接关系的其他组件是否可能发生故障,以及可能发生故障的组件的排序。这里的神经网络模型可以基于训练数据不断学习基于报障组件得到其他可能发生故障的组件的规则,以及多个可能发生故障的组件之间的排序规则。
通过上述方法,通过神经网络模型可以适应不同的设备和应用场景,学习到不同的检测规则和排序规则,有利于提高故障检测准确率,适用范围广。
在一种可能的实施方式中,其他组件包括在组件拓扑图中,第一组件的上游组件以及第一组件的下游组件。
在一种可能的实施方式中,确定组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障,包括:针对其他组件中的任意一个组件,若组件存在至少一个可能存在故障的下一级组件,则确定组件可能存在故障。
通过上述方法,基于组件拓扑图确定出于报错的组件具有连接关系的其他组件,可以快速定位故障检测范围,提高故障检测效率。
在一种可能的实施方式中,第二组件的数量大于1,输出第二组件具体包括:对多个第二组件发生故障的概率进行排序;输出排序后的多个第二组件。
通过上述方法,通过排序可以将较大可能发生故障的节点排在前面,以此向用户指导检修顺序,提高用户的检修效率。
在一种可能的实施方式中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;对多个第二组件发生故障的概率进行排序,包括:若父组件不具有传感器,且一个或多个子组件的数量大于1,则确定父组件发生故障的概率大于子组件发生故障的概率。
在一种可能的实施方式中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;对多个第二组件发生故障的概率进行排序,包括:若父组件不具有传感器,且子组件的数量等于1,则确定父组件发生故障的概率与子组件发生故障的概率相同。
通过上述方法,可以检测不具有传感器的组件,不需要增加硬件开销。
在一种可能的实施方式中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;对多个第二组件发生故障的概率进行排序,包括:若父组件具有传感器,且父组件的传感器报错,则父组件发生故障的概率大于子级组件发生故障的概率。
在一种可能的实施方式中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;对多个第二组件发生故障的概率进行排序,包括:若父组件具有传感器,且父组件的传感器未报错,且子组件的数量大于1,则确定父组件发生故障的概率大于子组件发生故障的概率。
通过上述方法,不仅依赖传感器进行故障检测,及时发现可能发生故障的节点,避免 传感器故障导致的漏检,提高用户的检修效率。
在一种可能的实施方式中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;对多个第二组件发生故障的概率进行排序,包括:若父组件具有传感器,且父组件的传感器未报错,且子组件的数量等于1,则确定父组件发生故障的概率小于子组件发生故障的概率。
在一种可能的实施方式中,输出第二组件包括:通过图形界面输出第二组件;图形界面还包括用于指示第二组件排序的编号,编号位于预设区域内。
通过上述方法,能够更加直观地为用户展示排序结果,提高用户使用体验。
在一种可能的实施方式中,第一组件具有传感器;还包括:根据传感器确定第一组件已发生故障。
第二方面,本申请实施例还提供了一种故障检测装置,该装置具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述故障检测装置的结构中包括获取模块、确定模块和输出模块。这些模块可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请还提供了一种故障检测设备,所述故障检测设备包括处理器和存储器,还可以包括通信接口,所述处理器执行所述存储器中的程序指令执行上述第一方面或第一方面任一可能的实现方式提供的方法。该故障检测设备可以为计算机设备中的独立模块,如基板管理控制器(baseboard manager controller,BMC)。所述存储器与所述处理器耦合,其保存故障检测过程中必要的程序指令和数据(如保存组件拓扑图)。所述通信接口,用于与其他设备进行通信。
第四方面,本申请提供了一种计算机可读存储介质,所述计算书可读存储介质被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第五方面,本申请提供了一种计算设备程序产品,所述计算设备程序产品包括计算机指令,在被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
第六方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。
附图说明
图1A为本申请实施例提供的一种可能的系统架构示意图;
图1B为本申请实施例体提供的一种故障检测装置140的功能示意图;
图2为本申请实施例提供的故障检测方法所对应的流程示意图;
图3为本申请实施例提供的一种组件拓扑图;
图4为本申请实施例提供的故障检测方法中的主检测流程示意图;
图5为本申请实施例提供的故障检测方法中无感节点检测流程示意图;
图6为本申请实施例提供的故障检测方法中有感节点未报障的检测流程示意图;
图7为本申请实施例提供的基于神经网络模型的训练校正流程示意图;
图8为本申请实施例提供的一种图像界面的示意图;
图9为本申请实施例提供的另一种图像界面的示意图;
图10为本申请实施例提供的第三种图像界面的示意图;
图11为本申请实施例提供的计算机设备10内的部分组件的硬件结构示意图;
图12为本申请实施例提供的计算机设备10的一种组件拓扑图;
图13为本申请提供的一种故障检测装置的结构示意图。
具体实施方式
本申请提供的故障检测方法可以应用于计算机设备,该方法可以在检测到计算机设备内的组件报障时,基于组件拓扑图检测该组件的关联组件,从而发现可能存在故障的一系列组件,并输出这些可能存在故障的组件,以指导用户进行检修,提高检修效率。
本申请中的计算机设备包括但不限于:服务器、存储设备、计算设备、用户设备(user equipment,UE)等。UE包括台式电脑、笔记本电脑、平板电脑、手机、手持式设备、车载设备、可穿戴设备等等。本申请实施例对计算机设备的类型和结构不做限定,任何具备电子组件的设备均适用于本申请实施例。
图1A为本申请实施例提供的一种计算机设备10的结构示意图。如图1A所示,该计算机设备10包括处理器110、内存120、外存130、故障检测装置140、总线150。其中,处理器110、内存120、外存130以及故障检测装置140之间通过总线150连接。
处理器110可以为中央处理器(central processing unit,CPU)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)、人工智能(artificial intelligence,AI)芯片、片上系统(system on chip,SoC)或复杂可编程逻辑器件(complex programmable logic device,CPLD),图形处理器(graphics processing unit,GPU)等。
内存120,是指与处理器110直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为运行在处理器112上的操作系统或其他正在运行中的程序的临时数据存储器。内存包括易失性存储器(volatile memory),例如随机存取存储器(Random Access Memory,RAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)等,也可以包括非易失性存储器(non-volatile memory),例如存储级内存(storage class memory,SCM)等,或者易失性存储器与非易失性存储器的组合等。
外存130,也可以称为辅助存储器,可以为非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),硬盘驱动器(hard disk drive,HDD)或固态驱动器(solid state disk,SSD)等。
值得注意的是,计算机设备10中的一些组件中还集成有故障传感器(图1A中未示出),例如,图1A中的CPU、硬盘(如HDD、SSD)等均具有各自的故障传感器。
故障传感器可以位于组件内部,用于检测该组件的运行状态,运行状态包括运行正常 和故障,故障传感器可以通过不同的值来指示这两种不同的状态。例如,故障传感器的值为1表示运行正常,值为0表示故障。在检测到组件故障时,故障传感器生成用于指示组件故障的信号(如下称为故障信号),以指示该节点故障。例如,电子设备上的运行状态指示灯,当设备运行正常时该指示灯显示为绿灯,当设备运行异常时显示为红灯。为便于说明,下文中,将组件的故障传感器发送故障信号称为组件报障。
故障检测装置140,是计算机设备10内独立内运行的管理子系统,可以获取计算机设备10内其他组件的故障信号,以执行本申请实施例提供的故障检测方法。
本申请中,故障检测装置140可以是新的组件集成在计算机设备10内,该新的组件具有本申请实施例提供的故障检测方法这一功能。或者,故障检测装置140也可以是计算机设备10内具备本申请实施例所提供的故障检测方法这一功能的已有组件,例如BMC,BMC是服务器的关键构成部分,是一个单独在服务器内运行的管理子系统。BMC作为一个平台管理系统,具备一系列的监控和控制功能,其硬件是服务器的主板第一个上电启动部件和带外管理系统。如下结合图1A和图1B以BMC为例对故障检测装置140进行介绍。
如图1A所示,在硬件上,故障检测装置140包括处理器151、内存152、外存153、通信接口154,其中,处理器151、内存152、外存153以及通信接口154通过总线连接。
处理器151,用于对数据进行处理、计算等,例如,处理器151可以运行本申请实施例提供的故障检测方法。处理器151与处理器110类似,如该处理器151可以为CPU、ASIC、FPGA、AI芯片、SoC、CPLD、或GPU等。在软件层面,处理器151上运行有操作系统,该操作系统可以是X86、Arm、UNIX、轻量级系统或自定义操作系统等等,本申请实施例对此不做限定。应注意,处理器151运行的操作系统与处理器110运行的操作系统是相互独立的。也就是说,计算机设备10内的组件故障时,如处理器110故障,不会对故障检测装置140产生影响。
内存152是指与处理器151直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为运行在处理器151上的操作系统或其他正在运行中的程序的临时数据存储器。例如内存152中可以存储计算机设备10的组件拓扑图,当处理器151执行本申请提供的故障检测方法时,可以从该内存152中获取计算机设备10的组件拓扑图。
外存153,用于提供存储资源,可以为非易失性存储器(non-volatile memory),例如ROM、HDD、SSD、闪存(flash)存储器等。通用BMC是使用flash颗粒来充当硬盘的功能。与内存不同之处在于,硬盘的读写速度比内存慢,通常用于持久性地存储数据。在申请中,计算机设备10的组件拓扑图也可以在外存153中进行持久化存储,当处理器151执行本申请提供的故障检测方法时,可以将该组件拓扑图从外存153迁移至内存152中,处理器151从内存152中获取该组件拓扑图。
通信接口154,用于与计算机设备10内的其他组件或计算机设备10外部的设备通信。例如,处理器151通过通信接口154获取处理器110、内存120、外存130的故障传感器生成的故障信号。又例如,处理器151可以通过通信接口154将检测结果发送给显示设备。其中,显示设备为用户侧设备,如图1B所示,显示设备例如可以是BMC的web(网页)显示屏,移动终端设备,如手机、平板电脑等、具有特定软件如工具软件、网管软件、云端运维软件的设备、以及各种显示器如液晶显示器(liquid crystal display,LCD)、发光二极管(light emitting diode,LED)屏,本申请实施例对显示设备不做限定,任何具有显示屏的设备均适用于本申请实施例。
处理器151还可以与其他处理器如人工智能引擎通信等等,该人工智能引擎可以部署在计算机设备10内部,也可以部署在计算机设备10外部,人工智能引擎可以用于辅助故障检测装置140执行本申请实施例提供的故障检测方法,下文会进行详细介绍,这里不做赘述。
下面结合附图2,以图1A所示出的计算机设备10为例,对本申请实施例提供的故障检测方法进行说明。需要说明的是,图1A仅示出计算机设备10的少量组件以保持简洁,实际应用中,计算机设备10可以具有比图1A所示出的更多的组件,例如,计算机设备10还可以包括网卡、主板等,当然,计算机设备10也可以具有更少的组件,本申请实施例对此计算机设备10的结构不做限定。
图2为本申请实施例提供的故障检测方法所对应的流程示意图。该方法可以由图1A中的故障检测装置140(或处理器151)来执行,用于检测计算机设备10中可能存在故障的组件。为便于说明,下文中以故障检测装置140为BMC为例。也就是说,下文中的BMC均可以被替换为故障检测装置140。如图2所示,该方法包括如下步骤:
步骤201,BMC获取计算机设备10的组件拓扑图。
计算机设备10的组件拓扑图,用于描述计算机设备10中的组件以及组件之间的连接关系。该连接关系可以是组件之间的逻辑连接关系,或者是组件之间的物理连接关系。通常,软件系统的组件拓扑图是基于逻辑连接关系生成的,而硬件系统的组件拓扑图是基于组件之间的物理连接关系生成的。
以硬件系统的组件拓扑图为例,示例性地,组件拓扑图中的每个节点用于表示一个可更换的组件,例如CPU、内存、硬盘、网卡、通信线缆、接口转接卡等等。该组件可以是具有故障传感器的组件,也可以是不具有故障传感器的组件,例如,CPU、内存、硬盘等具有故障传感器,通信线缆、接口转接卡等不具有故障传感器。如下将组件拓扑图中具有故障传感器的节点称为有感节点,将不具有故障传感器的节点称为无感节点。
组件拓扑图中每个组件可以由节点标识来表示,节点标识用于唯一标识一个组件。在组件拓扑图中,组件和节点是相同的概念,两者可以互换。节点标识可以由数字、字母等一项或多项组成。节点标识的位数还可以表示该组件在组件拓扑图中的层级。例如,节点标识为一位数表示该节点位于第一层级,在组件拓扑图中位于第一层级的节点为根节点。节点标识为两位数表示该节点位于第二层级,节点标识为三位数表示该节点位于第三层级,以此类推。
其中,第一层级是第二层级的上一层级,第二层级是第三层级的上一层级,依此类推。对应的,第二层级是第一层级的下一层级,第三层级是第二层级的下一层级,依此类推。以某一节点如节点A为例,对节点之间的关系进行说明:在组件拓扑图中,节点A的上一层级中与节点A有连接关系的节点为节点A的父节点,节点A的下一层级中与节点A有连接关系的节点为节点A的子节点。节点A的上游节点包括根节点至节点A的路径所经过的所有节点,节点A的下游节点包括节点A的子节点,以及节点A的子节点的子节点等等,直至最末端的节点。本申请中,节点A的关联节点包括节点A的上游节点和节点A的下游节点。
请参见图3,图3为本申请实施例提供的一种组件拓扑图的示意图。如图3所示,该组件拓扑图中,根节点的节点标识为0,根节点0的子节点的节点标识依次为00,01,…,0i,i取正整数。以节点00为例,节点00的子节点为000,001,002,…,00j,j取正整 数。其中,节点0000的父节点为000,节点0000的子节点包括节点00000,还可以包括节点00001、节点00002等(图3未示出)。节点0000的上游节点包括节点000,节点00,节点0。节点0000的下游节点包括节点00000等。
基于上述节点标识的编制方法可以得出,任意一个节点如mno,去掉其末尾字符可得到其父节点mn,在mno末尾字符后增加0,1,2,…则可匹配到其子节点mno0,mno1,…。
需要说明的是,上述节点标识的编制方法仅为举例,凡是可以唯一表征组件的节点标识均适用于本申请实施例,并且本申请实施例对节点标识的功能不做限定,其可以通过位数表示层级也可以不做表示。其可以通过是否包含相同的字符来表示节点之间是否存在关联关系,也可以不做表示,例如,根节点的节点标识为abc,根节点的子节点的节点标识为def,等等。由于前述的节点标识的编制方法便于理解和记忆,在下文中,将继续基于前述的节点标识的编制方法进行说明。
如上介绍了组件拓扑图,在步骤201中,BMC获取组件拓扑图的方式有多种,例如,可以是基于导入或烧录在BMC中的配置脚本生成的,该配置脚本用于描述组件拓扑图所包含的组件的节点标识以及节点之间的连接关系,该配置脚本还可以包括组件与节点标识的对应关系,还可以包括节点的其他信息,例如是否为有感节点等。
步骤202,BMC检测到第一节点报障。
这里的第一节点可以是组件拓扑图中的任意一个有感节点,BMC检测到第一节点报障是指,BMC获取到第一节点的故障传感器生成的故障信号,下文类似之处不再重复说明。
示例性地,在图1A所示的计算机设备10内,BMC可能检测到处理器110、内存120或外存130故障。BMC可能检测到一个或多个节点报障,如下以一个为例进行介绍。
步骤203,BMC基于组件拓扑图检测其他可能存在故障的第二节点。
当BMC检测到第一节点报障后,基于组件拓扑图查找第一节点的关联节点(包括第一节点的上游节点和第一节点的下游节点),并检测第一节点的关联节点中是否存在可能故障的节点(如第二节点),应理解,这里的第二节点表示被检测到的可能存在故障的节点,因此,第二节点的数量可以有一个或多个,也即该步骤可能检测到一个或多个第二节点。更进一步地,本领域技术人员可以知道,即使故障传感器报障也不一定是节点自身故障,可能是其他节点故障引起该节点的故障传感器报障,因此,这里赋予每个可能存在故障的节点一个故障概率,下文类似之处也不再重复说明。后续可以根据故障概率来确定故障起因。
举例来说,BMC检测到第一节点报障,如该第一节点为节点mno,BMC将节点mno作为目标节点,然后基于组件拓扑图向上追溯,检测目标节点mno的父节点mn是否故障,在检测父节点mn是否存在故障时,可以基于父节点mn是有感节点还是无感节点来执行不同的检测,如果父节点mn是无感节点,则执行本申请实施提供的无感节点检测方式。如果父节点是有感节点,则继续判断该父节点是否报障,如果未报故障,则执行本申请实施提供的有感节点未报障的检测方式。如果通过上述一系列检测发现了该父节点相关的故障现象,则认为父节点mn也可能存在故障,且父节点mn故障的可能性大于节点mno,也就是说,节点mno发生故障有可能是父节点mn引起的。该父节点mn即为一个第二节点,之后,将节点mn作为新的目标节点,返回执行上述流程,直至将节点mno的关联节点都检测完成。
如下结合图4至图6,对上述各检测方式进行详细说明。应理解,在图4至图6的方 法中均由BMC执行。
参见图4,图4为本申请实施例提供的主检测流程示意图。如图4所示,该方法包括如下步骤:
步骤401,检测到节点可能存在故障,如该节点报障,将该节点作为目标节点。
步骤402,高亮该目标节点,并记录该目标节点的故障概率f1=x。
这里的高亮可以是记录或标记该节点为可能存在故障的节点,也可以是在包含组件拓扑图的图像界面中高亮该节点,下文类似之处不再重复说明。另外,这里的x是一个参考值,可以取任意值。在下文中,被高亮的节点表示可能存在故障(现象)的节点,未被高亮的节点表示认为不存在故障的节点。
步骤403,基于组件拓扑图查找该目标节点的父节点。
步骤404,判断该父节点是否是有感节点;如果是,则执行有感节点检测流程(续参见步骤405~步骤408),否则,执行无感节点检测流程(参见图5所示的流程),即跳转至步骤501。
步骤405,判断该父节点是否报障,如果是,则执行步骤406,否则,执行有感节点不报障的检测流程(参见图6所示的方法流程),即跳转至步骤601。
步骤406,高亮该父节点,并定义该父节点的故障概率f1=x+1。
如果父节点报障,则确定父节点可能存在故障,并且子节点的故障也可能是父节点引起的,因此父节点的故障概率大于子节点的故障概率。应理解,这里的父节点的f1=x+1,表示父节点的故障概率在子节点的故障概率的基础上+1,用于指示父节点的故障概率大于子节点的故障概率,实际上,本申请实施例并不限定在+1,任何可以表示两者之间大小的算法均适用于本申请实施例。
步骤407,判断该父节点是否为根节点,如果是,则执行步骤408,否则,将该父节点作为新的目标节点,并返回执行步骤403。
步骤408,得到可能存在故障的各节点,以及各节点的故障概率。
参见图5,图5为本申请实施例提供的一种无感节点检测流程示意图。如图5所示,该方法包括如下步骤:
步骤501,基于组件拓扑图查找该无感节点的子节点。
步骤502,判断该无感节点的子节点的数量是否≥1,如果≥1,即该无感节点至少有一个子节点,则执行步骤503,如果﹤1,即该无感节点没有子节点,执行步骤509。
步骤503,逐一遍历该无感节点的子节点【0,1,2…】。
需要说明的是,这里所示出的无感节点的子节点【0,1,2…】仅为示意,并不表示该无感节点肯定有子节点,也不表示该无感节点的字节点包括至少3个。该无感节点也可能只有1个或2或更多个子节点,本申请实施例对此不做限定。下文类似之处不再重复说明。
应理解,当该无感节点具有至少两个子节点时,该无感节点子节点遍历流程会执行多次,例如通过多个线程并行执行,也可以通过一个线程串行执行,当最后一个子节点检测完成后,不论该最后一个子节点执行有感节点检测流程(步骤503~步骤506),还是执行有感节点不报障检测流程,当最后一个子节点检测完成后,均跳转至步骤507。
在遍历时,可以按照子节点的节点标识顺序选择一个子节点作为当前子节点。
步骤504,判断当前子节点是否为有感节点,如果是,则执行步骤505,否则,返回执行步骤501。
应注意,通过步骤504可以执行迭代的检测,例如,节点mno的子节点包括节点mno0和节点mno1,首先遍历节点mno0,判断节点mno0是否为有感节点,如果是,则继续检测节点mno0是否报障(参见步骤505),如果mno0为无感节点,则返回执行步骤501:检测节点mno0的子节点,如包括节点mno00、节点mno01,继续执行后续流程,检测节点mno00的子节点数量等等,后续不再赘述。
值得注意的是,图5所示出为同一次迭代的步骤流程,换而言之,步骤505中的当前子节点与步骤504中的当前子节点是同一个节点。
步骤505,检测当前子节点是否报障,如果是,则执行步骤506,否则,执行有感节点不报障的检测流程(参见图6所示的方法流程),即跳转至步骤601。
步骤506,高亮该当前子节点,并定义该当前子节点的故障概率f1=x。
应理解,在本轮迭代中,该当前子节点为步骤401中目标节点的父节点的子节点,也即,目标节点的兄弟节点,因此,该当前子节点报障时,其故障概率与目标节点的故障概率相等。
步骤507,该无感节点的子节点遍历完成后,确定该无感节点的子节点【0,1,2…】中被高亮的子节点的数量,并判断该数量是否≥1;如果≥1,则执行步骤508,如果﹤1,即该无感节点的子节点均未被高亮,则该无感节点不高亮。
步骤508,判断该无感节点的子节点【0,1,2…】中被高亮的子节点的数量是否≥2;如果﹤2(即=1),执行步骤509;如果≥2,则执行步骤510。
步骤509,高亮该无感节点,并定义该无感节点的故障概率f1=x。
如果该无感节点具有一个被高亮的子节点,该被高亮的一个子节点的故障可能是该无感节点的故障引起的,因此,该无感节点也可能存在故障。由于该无感节点没有故障传感器,无法明确界定该无感节点是否存在故障,因此,可以将其故障概率设置为与子节点的故障概率相等。
步骤510,高亮该无感节点,并定义该无感节点的故障概率f1=x+1。
如果该无感节点具有至少两个被高亮的子节点,该被高亮的至少两个子节点的故障很可能是该无感节点的故障引起的,因此,该无感节点也可能存在故障,且其故障概率高于子节点的故障概率。
参见图6,图6为本申请实施例提供的一种有感节点不报故障的检测流程示意图。如图6所示,该方法包括如下步骤:
步骤601,基于组件拓扑图查找该未报障的有感节点的子节点【0,1…】。
步骤602,判断该有感节点的子节点的数量是否≥1,如果≥1,即该有感节点有至少一个子节点,则执行步骤603,否则,即该有感节点没有子节点,该有感节点不高亮。
步骤603,逐一遍历该有感节点的子节点【0,1…】。
应理解,当该有感节点具有至少两个子节点时,该有感节点子节点遍历流程会多次执行的,可以通过多个线程并行执行遍历,也可以通过一个线程串行执行遍历,当遍历到最后一个子节点,不论该最后一个子节点执行有感节点检测流程(步骤603~步骤605),还是执行无感节点检测流程,当最后一个子节点检测完成后,均跳转至步骤607。
在遍历时,可以按照子节点的节点标识顺序选择一个子节点作为当前子节点。
步骤604,判断当前子节点是否为有感节点,如果是,则执行步骤605,否则,执行无感节点检测流程(参见图5所示的方法流程),即跳转至步骤501。
步骤605,判断当前子节点是否报障,如果是,则执行步骤606,否则,将该子节点作为新的未报障的有感节点,返回执行步骤601。
应注意,通过步骤605可以执行迭代的检测,例如,节点mn的子节点包括节点mn0和节点mn1,首先遍历节点mn0,判断节点mn0是否为有感节点,如果是,则继续检测节点mn0是否报障(参见步骤605),如果节点mn0未报障,则返回执行步骤601:检测节点mn0的子节点,如包括节点mn00、节点mn01,继续执行后续流程,检测节点mn00的子节点数量等等,后续不再赘述。
值得注意的是,图6所示出为同一次迭代的步骤流程,换而言之,步骤606中的当前子节点与步骤605中的当前子节点是同一个节点。
步骤606,高亮该当前子节点,并定义该当前子节点的故障概率f1=x。
参见步骤506的解释,此处不再赘述。
步骤607,该有感节点的子节点遍历完成后,判断该有感节点的子节点【0,1…】中被高亮的子节点的数量,并判断该数量是否≥1;如果≥1,则执行步骤608,如果﹤1,即被高亮的子节点的数量为0,则该有感节点不高亮。
步骤608,判断该有感节点的子节点中被高亮的子节点的数量是否≥2,如果﹤2(即=1),则执行步骤609;如果≥2,则执行步骤610。
步骤609,高亮该有感节点,并定义该有感节点的故障概率f1=x-1。
由于该有感节点存在一个被高亮的子节点,也即存在一个存在故障现象的子节点,该子节点的故障可能是该有感节点的故障引起的,因此,即使该有感节点未报障,仍可以认为其可能存在故障,但其故障概率要低于报障的子节点的故障概率。
步骤610,高亮该有感节点,并定义该有感节点的故障概率f1=x+1。
如果该有感节点具有至少两个被高亮的子节点,该被高亮的至少两个子节点的故障很可能是该有感节点的故障引起的,因此,该无感节点也可能存在故障,且其故障概率高于子节点的故障概率。
值得注意的是,上述图4至图6所示的方法可能是重复执行的,例如,BMC可能检测到多个节点报障,则针对每个报障的节点可以分别作为目标节点通过上述流程进行检测,又例如,上述流程中可以确定出其他的被高亮的节点,BMC可以将高亮的节点当作新的目标节点,继续追溯与该节点相关的可能存在故障的节点。然而,由于节点之间存在着复杂的连接关系,在重复执行上述流程时,如果该节点已被检测过,且具有故障概率值,则不需要重复检测,下文会结合实施例进行具体举例说明。
通过上述方式,BMC可以基于报障的第一节点检测到与第一节点的关联节点中可能存在故障的一个或多个第二节点。为便于说明,如下将第一节点,和该一个或多个第二节点均称为故障节点。
步骤204,BMC输出故障检测结果。
在一种实施方式中,BMC可以输出故障概率超过预设阈值的故障节点,即故障检测结果包括故障概率超过预设阈值的故障节点。其中,预设阈值≥0。更进一步地,BMC还可以对故障检测结果所包含的故障节点进行排序,输出排序后的多个故障节点。如下将故障序列表示排序后的多个故障节点。该故障序列可以用于向用户指示检修顺序,由于某些节 点的故障可能是其他节点引起的,当检修某个节点后,可能使得其他故障节点不再故障,从而提高检修效率。
示例性地,BMC可以根据故障节点的排序变量,对故障节点进行排序,从而得到故障序列。其中,排序变量包括但不限于故障概率f1、用于指示维修难度的排序变量f2、用于指示故障率的排序变量f3;其中,故障节点的故障概率f1是基于步骤203确定的;排序变量f2用于指示检修、更换节点的难易程度,每个节点的排序变量f2的值可以是预设的,在实际应用中,可以根据节点所对应的硬件的安装位置、体积、工作环境、成本价格、维修价格等因素确定。排序变量f3是指节点的故障率,该故障率可以是经过实验测试得到的节点固有的故障率,也可以是在运行过程中统计的,如果是节点固有的故障率,则节点的排序变量f3是预设值,如果是后者,则节点的排序变量f3可以是变化的。需要说明的是,上述列举的排序变量仅为举例,本申请实施例对此不做限定,任何与节点故障或维修难易相关的因素均可以作为排序变量。
由上可知,每个节点可以具有一个或多个排序变量,BMC可以根据各个故障节点的排序变量对该多个故障节点进行排序,如下列举几种排序方式。
排序方式一:按照故障概率大小排序。
即,基于各故障节点的故障概率f1的值由大到小进行排序,得到故障序列。
排序方式二:权重排序法。
示例性地,每个排序变量被赋予一个预设的权重值,BMC可以并通过下列公式1确定每个故障节点的故障综合值:
y=f1w1+f2w2+…+fiwi    公式1;
其中,y表示故障综合值;fi表示排序变量fi;wi表示排序变量fi的权重值。其中,i取正整数。
应理解,每个故障节点可以具有一个或多个排序变量,每个故障节点所包含的排序变量可以相同,也可以不同,或者不完全相同,例如,节点mn的排序变量包括f1、f2,节点m的排序变量包括f1、f3,本申请实施例对此不做限定。
BMC通过上述公式1分别计算每个故障节点的故障综合值,并按照各故障节点的故障综合值由大到小进行排序,得到故障序列。
排序方式三:优先级排序法。
示例性地,每个排序变量被赋予一个预设的优先级,或优先级顺序,BMC可以按照优先级由高到低的顺序,先按照优先级最高的排序变量的值的大小,对多个故障节点进行排序,如果存在值相等的多个节点,则可以继续按照优先级其次的排序变量的值进行排序,依此类推。例如,f1、f2、f3的优先级排序为:f1>f2>f3,BMC可以先按照各f1的值的大小对多个故障节点进行排序,如果存在f1值相同的多个节点,则再继续按照该多个节点的f2的值的大小进行排序,依次类推,直到将所有节点排序完成。
如假设:
节点m的排序变量包括:f1(值为0.8),f2(值为0.6),f3(值为0.2);
节点mn的排序变量包括:f1(值为0.2),f2(值为0.4);
节点mn0的排序变量包括:f1(值为0.6),f3(值为0.1);
节点mn01的排序变量包括:f1(值为0.2),f2(值为0.9)。
BMC首先按照优先级最高的f1值的大小进行排序,可以确定m>mn0,由于mn和 mn01的f1值相等,则可以继续按照优先级次之的f2值的大小进行排序,可以继续确定故障序列为:m>mn0>mn01>mn。需要说明的是,上述数值仅为举例,并不表示逻辑上的可能性。
排序方式四:基于神经网络模型进行排序。
BMC可以结合神经网络模型对故障序列进行辅助决策以及训练校正。也即其可以用于确定故障序列,也可以对上述确定的故障序列进行训练校正。
一、为便于理解,首先从训练校正进行介绍。
结合上述排序方式三的例子,BMC确定的故障序列为:m>mn0>mn01>mn。该故障序列可以作为故障检测结果输出给用户,以向用户指导检修顺序。例如,用户基于上述故障序列首先检修节点m,如果检修节点m后其他节点的故障解除,则确认故障起因是节点m,如果检修节点m后其他节点的故障未解除,则按照故障序列继续检修下一个节点mn0,同理,如果检修节点mn0后故障解除,则确认故障起因是节点mn0,如果故障未解除,则继续检修下一个,依次类推。
用户可以将检修结果输入至神经网络模型。对应的,神经网络模型可以基于该检修结果对排序算法进行训练校正。如图7所示,图7示出了该训练校正的流程示意图,其流程包括:
步骤701,选择故障序列中的首个节点。
步骤702,基于检测结果判定该节点检修完成之后,其他节点的故障是否解除,如果解除,则执行步骤703,否则执行步骤704。
步骤703,该节点的置信度+1。
步骤704,该节点的置信度-1。
步骤后705,判定该节点是否为故障序列中的末尾节点,如果是,则结束该流程,否则,执行步骤706。
步骤706,顺序选择故障序列中的下一个节点,并返回执行步骤702。
举例来说,针对上述故障序列:m>mn0>mn01>mn,该故障序列中的首个节点为节点m;若检修结果指示节点m是故障起因,则将该场景下节点m的置信度加1,其余节点的置信度不变;又例如,若检修结果指示节点mn0是故障起因,也即上述故障序列的排序错误,则将该场景中节点m的置信度减1,节点mn0的置信度加1。再例如,若检修结果指示节点mn01是故障起因,则将该场景中节点m的置信度减1,节点mn0的置信度减1,节点mn01的置信度加1,其余节点的置信度不变,依此类推。
这里所指的场景包括两个条件,1)首先检测到报障的节点,即步骤401中检测到的报障节点;2)基于该报障的节点触发的故障检测所得到的故障序列。例如,节点mn01发生故障,故障序列为m>mn0>mn01>mn,这是一个完整的场景。这是由于不同的节点报障,得到的故障序列可能是相同的,但实际上经过训练校正后的故障序列可能是不同的,又例如,节点mn0发生故障,故障序列为m>mn0>mn01>mn。所以该场景要包括触发故障检测的故障节点。
针对任一个场景,神经网络模型可以确定每个节点在该故障序列中的位置的置信度,若经过训练校正,该节点的置信度的值超过第一预设值,则将该节点的位置前移,或者,如果低于第二预设值,则将该节点的位置后移。例如,经过多次训练校正,确定节点m的置信度低于第二预设值,则将节点m移至节点mn0之后,得到校正后的故障序列mn0>m >mn01>mn;可以理解的是,置信度的值可以表示该节点的位置,置信度越大则在故障序列中的位置越靠前。
二、如下对基于该神经网络模型辅助决策故障序列的方式进行介绍。
BMC可以结合上述方式生成的故障序列以及神经网络模型确定在该场景下的故障序列,确定最终要输出给用户的故障序列。为便于描述,如下将排序方式一至排序方式三生成的故障序列称为第一故障序列,将神经网络模型确定的故障序列称为第二故障序列。将最终输出给用户的故障序列称为目标故障序列。如果第一故障序列与第二故障序列相同,则目标故障序列为第一故障序列或第二故障序列。如果第一故障序列与第二故障序列不同,则目标故障序列为第二故障序列。
在一种可选的实施方式中,BMC也可以单独使用神经网络模型确定目标故障序列。需要说明的是,该神经网络模型可以部署在BMC中,也可以部署于其他处理器中,例如,FPGA中,BMC可以与该处理器通信获取神经网络模型确定的故障序列。
步骤204:BMC输出故障检测结果。
在一种实施方式中,BMC可以通过图像界面的方式将故障检测结果展示给用户,用于向用户指导需要检修的组件以及检修顺序。
如下对BMC生成图像界面的方式进行介绍。
在一种实现方式中,BMC可以根据组件拓扑图生成包括组件拓扑图的图像,这里包括组件拓扑图的图像是指该图像包含该组件拓扑图中每个节点的控件,控件与节点标识一一对应。根据故障检测结果中故障节点的节点标识和该对应关系可以定位到该故障节点在图像中的控件,并高亮标记该控件。应理解,这里的高亮也是一种示意,也可以通过其他方式区分故障节点和非故障节点,例如文字、不同颜色、是否闪烁灯方式来区分,本申请实施例对此不做限定。如果在图4至图6的流程中,如果BMC已生成该图像,并且在该图像中高亮了故障节点,则BMC可以直接使用该图像。
BMC还可以基于故障序列对该图像执行后处理,如将高亮的节点串接形成故障路径,以及基于故障序列为故障节点编制编号,该编号可以用于表示该节点在故障序列中的位置。例如,在故障序列m>mn0>mn01>mn中,节点m的编号为1,节点mn0的编号为2,节点mn01的编号为3,依此类推。
参见图8,图8为本申请实施例提供的一种图像界面的示意图。如图8所示,该图像界面显示了图3所示的组件拓扑图,以及基于该组件拓扑图的故障路径和各故障节点的编号。
在另一种实现方式中,上述图像界面中的组件拓扑图也可以被替换为该组件拓扑图所对应的硬件实物图。请参见图9,图9为本申请实施例提供的另一种图像界面的示意图。如图9所示,该图像界面在计算机设备10的硬件实物图显示故障检测结果。类似于组件拓扑图,硬件实物图中每个控件表示一个组件的硬件实物,该控件与该组件的节点标识绑定,控件与节点标识一一对应,根据故障节点的节点标识和该对应关系可以定位到硬件实物图中的控件上。图9与图8的区别在,将用于表示节点标识的控件替换为用于表示该节点标识所对应的物理硬件的控件。
需要说明的是,(1)图8~图9仅为举例,本申请实施例的图像界面可以具有比图8或图9更多或更少的信息,如还可以显示故障节点的名称、IP地址、故障时间等等其他信息,本申请实施例对此不做限定。(2)上述通过图像界面展示故障检测结果的方式仅为举例, 本申请还可以通过其他方式展示故障检测结果,如图10所示,为通过文字的方式来展示故障检测结果,除此之外,还可以通过如视频、动画、语音等方式来显示故障检测结果,本申请实施例对此不做限定,任何能够展示故障检测结果的方式均适用于本申请实施例。
BMC也可以将故障检测结果发送至其他设备或组件,例如处理器110,或具备计算能力的显示设备,由这些设备或组件按照上述BMC执行的方式来生成用于表示故障检测结果的图像,这样,可以降低对BMC运算能力的要求。应理解,如果由BMC之外的设备生成图像,则该设备应具有与BMC相同的组件拓扑图,如可以是BMC发送给该设备的,也可以是其他方式如用户导入的,这里对此不做限定。
通过上述方式,BMC可以在侦测到第一节点报障后,基于组件拓扑图检测出第一节点的关联节点中可能存在故障的第二节点,并输出故障检测结果,以向用户提供检修指导。由于第二节点可以具有故障传感器也可以不具有故障传感器,因此本申请技术方案可以在不增加硬件成本的基础上,提高检修效率,适用场景也更广泛。
接下来以图1A所示的计算机设备10为例,对本申请实施例提供的故障检测方法进行举例说明。
首先,对计算机设备10内组件的硬件连接方式进行介绍。
如前所述,计算机设备10内的处理器110、内存120、外存130以及故障检测装置140之间通过总线150连接。其他组件请参见上文相关说明,如下仅对总线150进行介绍:
总线150,包括但不限于:双数据速率(double data rate,DDR)总线、快捷外设互联标准(peripheral component interconnect express,PCIe)总线、串行连接SCSI(serial attached scsi,SAS)总线、串行高级技术附件(serial advanced technology attachment,SATA)总线等。
从数据传输速度来比较,DDR总线快于PCIe总线,PCIe总线快于SAS总线和SATA总线。通常,处理器110与内存120之间通过DDR总线连接。处理器110与故障检测装置140之间可以通过PCIe总线连接。处理器110和外存130之间可以通过SATA总线或SAS总线连接,实际上,计算机设备10的实物内部之间的连接方式可能更复杂,下文会进行详细说明。
本领域技术可以知道,计算机设备是以主板为中心来集成各种组件的,请参见图11,图11示出了图1A所示的计算机设备10的实物连接方式。
其中,主板也叫母板,是计算机硬件系统的核心,计算机设备10中的组件通过主板连接。在硬件上,主板是一块印刷电路板(printed circuit board,PCB),主板上具有CPU插槽、内存插槽以及其他插槽(如显卡插槽)等。处理器110可以安插在主板的CPU插槽上,内存120可以安插在主板的内存插槽上。主板内部通过总线(例如DDR总线、PCIe总线等)实现插槽之间的连接。例如,CPU插槽与DDR插槽之间可以通过DDR总线连接,以实现处理器110与内存120的连接。
主板上还可以有各种通信接口如通用串行总线(universal serial bus,USB)接口、PCIe接口等等。其中,USB接口,可以用于接入具有USB接口的设备。又例如,PCIe接口,可以用于接入具有PCIe接口的组件,例如具有PCIe接口的网卡、PCIe接口转接卡(PCIe riser)等。PCIe riser,是主板上的PCIe接口的转接口,在硬件上,PCIe riser具有两个接口,该两个接口均为PCIe接口,其中前端接口与主板上的PCIe接口相连,后端接口可以与其他具有PCIe接口的组件相连,从而实现转接功能。虽然两端都是PCIe接口,但后端 接口可以适配于不同接口形态、或不同安装方式的组件,并且可以具有多个后端接口,从而接入多个具有PCIe接口的组件。在功能上,PCIe riser用于进行数据传输,不具有数据处理功能,类似于通信线缆的作用。
上文在介绍硬盘时所述的,硬盘的读写读速相较于内存慢,除了存储器自身的原因之外,还在于内存是直接接入处理器110的,而硬盘通常是通过SAS总线或SATA总线间接接入处理器110的。当然,如果硬盘具有内存接口如非易失性内存主机控制器(non-volatile memory express,NVMe)接口:NVMe接口也可以通过PCIe总线直接接入处理器110,这样可以提升硬盘的读写速度,但是性能仍低于内存。
示例性地,在间接接入方式中,硬盘通常需要借助一些组件例如磁盘阵列(redundant arrays of independent disks,RAID)和PCIe riser来接入处理器110,其中,RAID具有协议转换功能,示例性地,RAID具有SAS接口和PCIe接口,通过SAS接口接收SAS消息,通过PCIe接口接收PCIe消息,并可以将SAS的消息与PCIe的消息互相转换,以实现两侧设备的通信。如下SAS总线和HDD为例,对硬盘与处理器之间的连接方式进行介绍。
如图11所示,在实际产品中,为了便于扩容或缩容,通常是将HDD插入HDD背板(bacplane)的一个插槽(slot)中,HDD背板的每个插槽(也称为接口连接器(connector,CNN))用于接入一个HDD,其插槽数量决定外存可以集成的硬盘的数量。CNN一端连接HDD,另一端通过SAS线缆(如SAS CABLE)接入RAID的SAS接口,也即,CNN与RAID之间通过SAS CABLE相连,RAID的PCIe接口与PCIe riser的后端接口相连,PCIe riser的前端接口接入主板固有的PCIe接口,至此实现HDD与处理器的连接。
本领域技术人员可以知道,SAS CABLE通常为1*4型,即一根SAS CABLE可以并行将4个HDD接入RAID,应注意,这4个HDD之间是相互独立的。区别于主板上的焊接线,SAS CABLE为可更换的独立线缆,其损坏可能导致HDD故障。可以理解的是,对于1*4型的SAS CABLE,不论其中哪个HDD的SAS连接线损坏,该SAS CABLE均需要更换,因此,SAS CABLE为一个组件并非4个组件。
由于现有设备越来越复杂,其中可以同时存在多种通信协议,使用不同通信协议的组件通常不会互相干扰。因此,本申请中,如果用组件拓扑图来表示组件之间的物理连接关系,则构成该组件拓扑图的这些组件可以是使用同一总线协议连接的。也就是说一个组件拓扑图中不能包含2种及2种以上不同属性的总线。例如,A、B、C、D之间通过SAS总线互连,E、F和G之间通过PCIe总线互连。那么A、B、C、D属于同一个组件结构图,但是与E、F、G则不属于同一个组件结构图,E、F、G可以组成另一个组件拓扑图。
针对图11所示的计算机设备10,可以将使用同一总线协议(如SAS总线),且可更换的组件纳入一个组件拓扑图中,例如,HDD、HDD背板、SAS CABLE、RAID。参见图12,图12为计算机设备10的一种组件拓扑图,该组件拓扑图用于描述计算机设备10内的HDD、HDD背板、SAS CABLE、RAID的连接关系。在图12中,给定每个SAS CABLE为1*4型,每个RAID包括8个SAS通道。即每个RAID可以至少两个SAS CABLE。应注意,为了便于理解,图12所示的组件拓扑图展示的为组件的名称,实际上可以是节点标识。
下面结合图12所示的组件拓扑图架构,对本申请实施例提供的故障检测方法进行介绍。在介绍方法之前,首先声明,在图12中,RAID、HDD等为有感节点,其余组件为无感节点。
场景一:假设HDD2报障。
在场景一中,假设BMC已导入图12所示的组件拓扑图,在下文中出现的组件拓扑图均指图12所示的组件拓扑图,如下结合图4~图6所示的方法,对该场景一的检测流程进行介绍:
1)步骤401,检测到HDD2报障,将该HDD2作为目标节点。
步骤402,高亮该HDD2,并记录该HDD2的f1=x,例如,假设x=10。
步骤403,基于组件拓扑图确定目标节点(HDD2)的父节点,即HDD背板的CNN2(如下简称为CNN2)。下文中的CNN均指该HDD背板上的CNN。
步骤404,判断该父节点CNN2是否为有感节点,CNN2为无感节点,执行步骤501(无感节点检测流程)。
步骤501,基于组件拓扑图查找CNN2的子节点,即HDD2。
步骤502,判断CNN2的子节点(即HDD2)的数量是否≥1,由于CNN2有1个子节点,因此执行步骤503。
步骤503,遍历该CNN2的子节点HDD2。
步骤504,子节点HDD2是有感节点,执行步骤505。
步骤505,子节点HDD2报障,执行步骤506。
步骤506,高亮HDD2,记录HDD2的f1=10。
应理解,由于节点之间的关联关系较多,在向上溯源或向下溯源的检测过程中,该节点可能已被高亮,如果节点已被高亮,则不需要重复高亮,也即在遍历节点时,如果节点被高亮则可以不重复遍历,即上述步骤504至步骤506可以不执行。具体的,在一种可能的实现方式中,BMC可以仅遍历未被高亮的节点。在另一种可能的实现方式中,BMC可以记录已被遍历过的节点(包括已被遍历但未高亮的节点),对于已被遍历的节点不需要重复遍历。
步骤507,CNN2的子节点遍历完成后,确定CNN2的子节点中被高亮的子节点的数量,即1个(HDD2),执行步骤509。
步骤509,高亮CNN2,并定义CNN2的f1=10。
2)接下来,BMC将CNN2作为新的目标节点,重复执行上述流程,向上追溯CNN2的父节点是否故障。参见如下流程:
步骤401,BMC将CNN2作为目标节点。
值得注意的是,以CNN2为新的目标节点时确定其他可能故障的节点时,将以CNN2的f1的值作为参考值,例如,如果CNN2的f1=10,则x=10,若CNN2的关联节点的f1=x+1时,则该关联节点的f1=11。又如,如果CNN2的f1=11,则x=11,若CNN2的关联节点的f1=x+1时,则该关联节点的f1=12。
步骤402,高亮该CNN2,并记录该CNN2的f1,由上一轮检测得出CNN2的f1=10。
步骤403,基于组件拓扑图查找CNN2的父节点SAS CABLE1。
步骤404,判断父节点SAS CABLE1是否为有感节点,由于SAS CABLE1为无感节点,继续执行步骤501。
步骤501,基于组件拓扑图查找SAS CABLE1的子节点,其子节点包括CNN1、CNN2、CNN3和CNN4。
步骤502,判断SAS CABLE1的子节点的数量是否≥1,SAS CABLE1的子节点数量 为4个,因此执行步骤503。
步骤503,逐一遍历CNN1、CNN2、CNN3和CNN4。
步骤504,首先选择CNN1,判断CNN1是否为有感节点,由于CNN1为无感节点,返回执行步骤501。
步骤501,基于组件拓扑图查找CNN1的子节点,即HDD1。
步骤502,判断该CNN1的子节点的数量是否≥1,由于CNN1的子节点的数量=1,执行步骤503。
步骤503,遍历HDD1。
步骤504,判断HDD1是否为有感节点;HDD1是有感节点,执行步骤505。
步骤505,检测HDD1是否报障,基于前述假设的场景一可得,HDD1未报障,执行步骤601(有感节点未报障流程)。
步骤601,基于组件拓扑图查找HDD1的子节点。
步骤602,判断HDD1的子节点的数量是否≥1,由于HDD1没有子节点,即子节点的数量为0,执行步骤609。
步骤609,HDD1不高亮。
CNN1的子节点全部遍历完成后,跳转至CNN1这轮迭代的步骤507。
步骤507,判断CNN1中被高亮的子节点的数量是否≥1,由于CNN1没有被高亮的子节点,其数量为0,因此,该CNN1不高亮。
CNN1遍历完成后,顺序遍历CNN2、CNN3和CNN4。其中,CNN1、CNN3、CNN4均不高亮,有上述的1)流程可知CNN2高亮。
当SAS CABLE1的所有子节点遍历完成后,执行SAS CABLE1这轮迭代的步骤507。
步骤507,确定SAS CABLE1的子节点中被高亮的子节点的数量,仅CNN2被高亮,该数量为1,因此,经过步骤508的判断,执行步骤509。
步骤509,高亮该SAS CABLE1,并定义该SAS CABLE1的故障概率f1=10。
3)接下来,BMC继续将SAS CABLE1作为新的目标节点,重复执行上述流程,向上追溯SAS CABLE1的父节点是否故障。参见如下流程:
步骤401,BMC将SAS CABLE1作为目标节点。
步骤402,高亮该SAS CABLE1,并记录该SAS CABLE1的f1,由上一轮检测得出SAS CABLE1的f1=10。
步骤403,基于组件拓扑图查找SAS CABLE1的父节点RAID1。
步骤404,判断父节点RAID1是否为有感节点,由于RAID1是有感节点,继续执行步骤405。
步骤405,判断RAID1是否报障,基于前述假设的场景一可得,RAID1未报障,执行步骤601(有感节点未报障流程)。
步骤601,基于组件拓扑图查找RAID1的子节点,即SAS CABLE1和SAS CABLE2。
步骤602,RAID1的子节点的数量为2,执行步骤603。
步骤603,逐一遍历RAID1的子节点SAS CABLE1和SAS CABLE2。
其中,SAS CABLE1已被高亮即被遍历过(参见上述步骤501~步骤509),因此,这里可以仅遍历SAS CABLE2。
步骤604,判断SAS CABLE2是否为有感节点,由于SAS CABLE2为无感节点,执行步骤501(即无感节点检测流程)。
步骤501,基于组件拓扑图查找SAS CABLE2的子节点,即CNN5~CNN8。
步骤502,SAS CABLE2的子节点的数量为4,执行步骤503。
步骤503,逐一遍历SAS CABLE2的子节点CNN5~CNN8。
步骤504,首先是CNN5,CNN5为无感节点,返回执行步骤501。
步骤501,基于组件拓扑图查找CNN5的子节点,即HDD5。
步骤502,CNN5的子节点的数量=1,执行步骤503。
步骤503,遍历HDD5。
步骤504,HDD5为有感节点,执行步骤505。
步骤505,基于前述假设的场景一可得,HDD5未报障,执行步骤601(即有感节点未报障流程)。
步骤601,基于组件拓扑图查找HDD5的子节点。
步骤602,HDD5没有子节点,该HDD5不高亮。
CNN5的子节点(即HDD5)均遍历完成后,返回执行CNN5这轮迭代的步骤507。
步骤507,判断CNN5的子节点中被高亮的子节点的数量,由于CNN5仅有1个子节点HDD5,且该HDD5未被高亮,也即该CNN5被高亮的子节点的数量为0,则该CNN5也不高亮。
CNN5遍历完成后,顺序遍历CNN6、CNN7和CNN8,参见上述遍历CNN5的流程,此处不再赘述,其中,基于前述假设的场景一可知,由于HDD6~HDD8均未报障,因此,CNN6、CNN7、CNN8均不高亮。
当SAS CABLE2的所有子节点遍历完成后,返回执行SAS CABLE2这轮迭代步骤507。
步骤507,确定SAS CABLE2的被高亮的子节点的数量,由于CNN5~CNN8均未被高亮,也即该数量为0,因此,SAS CABLE2也不高亮。
当RAID1的子节点(即SAS CABLE1和SAS CABLE2)全部遍历完成后,返回执行步骤607。
步骤607,确定RAID1的子节点中被高亮的子节点的数量,由于SAS CABLE1被高亮,SAS CABLE2未被高亮,因此,该数量为1,执行步骤608。
步骤608,高亮RAID1,被定义该RAID1的故障概率为f1=10-1=9。
基于上述流程可以确定,当检测HDD2报障时,基于组件拓扑图,可以得到被高亮的节点包括HDD2、CNN2、SAS CABLE1和RAID1,其中,HDD2的f1=10,CNN2的f1=10,SAS CABLE1的f1=10,RAID1的f1=9。
各故障节点按照f1的值由大到小的排序为:HDD2=CNN2=SAS CABLE1>RAID1。
场景二:假设HDD1和HDD2报障。
在场景二中,仍假设BMC已导入图12所示的组件拓扑图,如下结合图4~图6所示的方法,对该场景二的检测流程进行介绍,为保持简洁,如下不再示出具体的步骤:
(1)BMC检测到HDD2报障,将HDD2作为目标节点,记录HDD2的f1=10,基于组件拓扑图确定HDD2的父节点CNN2,判断CNN2是否为有感节点,由于CNN2为无感节点,跳转至无感节点检测流程:基于组件拓扑图查找CNN2的子节点,即HDD2,HDD 已被高亮,可以不再重复遍历。当CNN2的子节点遍历完成后,判断CNN2的子节点中被高亮的子节点数量是否≥1,该数量为1(即HDD2),高亮CNN2,并记录CNN2的f1=10。
(2)BMC将CNN2作为新的目标节点,基于组件拓扑图确定CNN2的父节点SAS CABLE1,该SAS CABLE1为无感节点,执行无感节点检测流程:基于组件拓扑图查找该SAS CABLE1的子节点,CNN1~CNN4,逐一遍历其子节点,首先是CNN1,由于CNN1是无感节点,执行无感节点检测流程:查找CNN1的子节点HDD1,HDD1是有感节点,判断HDD1是否报障,基于前述假设的场景可知,HDD1报障,将HDD1高亮。当CNN1的子节点遍历完成后,判断CNN1被高亮的子节点的数量是否≥1,该数量为1,高亮CNN1,并记录CNN1的f1=10。
当CNN1遍历完成后,由于CNN2已被遍历,BMC可以继续遍历CNN3,CNN3为无感节点,查找CNN3的子节点HDD3,HDD3为有感节点,且基于前述假设的场景二可知HDD3未报障,因此,HDD3不高亮,CNN3的子节点遍历完成,确定CNN3的子节点中被高亮的子节点的数量为0,因此CNN3也不高亮。继续遍历CNN4,同理,CNN4也不高亮。
BMC将SAS CABLE1的子节点CNN1~CNN4均遍历完成后,判断SAS CABLE1的子节点中被高亮的子节点数量,其中CNN1和CNN2被高亮,即SAS CABLE1被高亮的子节点的数量=2,高亮该SAS CABLE1,并记录SAS CABLE1的f1=10+1=11。
(3)BMC将SAS CABLE1当作目标节点,SAS CABLE1的f1=11。基于组件拓扑图查找SAS CABLE1的父节点即RAID1,RAID1的父节点为有感节点,检测RAID1是否报障,基于前述假设的场景可知,RAID1未报障,执行有感节点未报障流程:查找RAID1的子节点,即SAS CABLE1和SAS CABLE2,遍历RAID1的子节点,首先是SAS CABLE1,由于SAS CABLE1已被遍历过,这里可以继续遍历SAS CABLE2,由于场景二中也假设了HDD5~HDD8未报障,因此,遍历SAS CABLE2的流程可以参见上述场景一的描述,此次不再赘述。由上述流程可知,SAS CABLE2未被高亮。
当BMC遍历完SAS CABLE1和SAS CABLE2之后,判断RAID1的子节点中被高亮的子节点的数量,其数量为1,即SAS CABLE1被高亮,因此高亮RAID1并记录RAID1的故障概率为f1=11-1=10。
需要说明的是,由于HDD1和HDD2从属于同一个父节点,因此,可以不重复将HDD2作为目标节点重复向上溯源检测,如果HDD2和HDD1没有关联,则类似于HDD1的流程,BMC可以将HDD2作为新的目标节点向上溯源检测HDD2相关的节点是否存在可能故障的节点,具体流程可以参见上述以HDD1为目标节点的检测流程,此处不再赘述。
基于上述流程可以确定,当检测HDD1和HDD2报障时,基于图12所示的组件拓扑图,可以得到被高亮的节点包括HDD1、HDD2、CNN1、CNN2、SAS CABLE1和RAID1,其中,HDD1的f1=10,HDD2的f1=10,CNN1的f1=10,CNN2的f1=10,SAS CABLE1的f1=11,RAID1的f1=10。各故障节点按照f1的值由大到小的排序为:SAS CABLE1>HDD1=HDD2=CNN1=CNN2=RAID1。
场景三:假设HDD1、HDD2和RAID1报障。
该场景三中,基于HDD1、HDD2检测CNN1和CNN2,以及SAS CABLE1的流程请参见上述场景二中的相关描述,此处不再赘述。如下对将SAS CABLE1作为目标节点,检测RAID1的流程进行介绍:
BMC将SAS CABLE1作为目标节点,从上述流程可得SAS CABLE1的f1=11,基于组件拓扑图查找SAS CABLE1的父节点,RAID1,RAID1为有感节点,检测RAID1是否报障,基于前述假设的场景三可以确定,RAID1报障,高亮该RAID1并记录该RAID1的f1=f1(SAS CABLE1)+1=11+1=12。
基于上述流程可以确定,当检测HDD1、HDD2和RAID1报障时,基于图12所示的组件拓扑图,可以得到被高亮的节点包括HDD1、HDD2、CNN1、CNN2、SAS CABLE1和RAID1,其中,HDD1的f1=10,HDD2的f1=10,CNN1的f1=10,CNN2的f1=10,SAS CABLE1的f1=11,RAID1的f1=12。各故障节点按照f1的值由大到小的排序为:RAID1>SAS CABLE1>HDD1=HDD2=CNN1=CNN2。
基于上述流程可以确定,以报障节点的关联节点中的某一节点为例,如果该节点有至少一个被高亮的子节点,则高亮该节点。通过下列方式确定各被高亮的节点的故障概率:
1,如果上游节点中无感节点的子节点中被高亮的子节点的数量大于1,那么无感节点故障的可能性高于目标节点。
2,如果上游节点中无感节点的子节点中被高亮的子节点的数量等于1,那么无感节点故障的可能性与目标节点相等。
3,如果上游节点中无感节点的子节点被高亮的数量为0,那么该无感节点不被高亮。
4,如果上游节点中有感节点报障,或有感节点未报障但有感节点的子节点被高亮的数量大于1,则该有感节点的可能性高于目标节点。
5,如果上游节点中有感节点未报障,且该有感节点被高亮的子节点的数量为1,该有感节点的可能性低于目标节点。
6,如果上游节点中有感节点未报障,且该有感节点被高亮的子节点的数量为0,该有感节点不高亮。
基于与方法实施例相同的构思,本申请实施例还提供了一种故障检测装置,该故障检测装置用于执行上述方法实施例中BMC所执行的方法。如图13所示,该故障检测装置1300包括获取模块1301、确定模块1302和输出模块1303。具体地,在故障检测装置1300中,各模块之间通过通信通路建立连接。
获取模块1301,用于获取组件拓扑图,组件拓扑图用于描述计算机设备中的各个组件,以及各个组件之间的连接关系;具体实现方式请参见图2中的步骤201的描述此处不再赘述。
确定模块1302,用于确定组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障;具体实现方式请参见图2中的步骤202~步骤203的描述,此处不再赘述。
输出模块1303,用于输出可能发生故障的第二组件,第二组件是其他组件和第一组件的子集。具体实现方式请参见图2中的步骤204的描述此处不再赘述。
在一个可能的设计中,组件拓扑图用于描述使用同一通信协议的组件之间的硬件连接关系。
在一个可能的设计中,输出模块1303具体用于通过图形界面输出第二组件;其中,图形界面显示有组件拓扑图,组件拓扑包括多个节点标识,多个节点标识与计算机设备中的各个组件一一对应;在组件拓扑图中与第二组件对应的节点标识被高亮显示;或图形界面显示计算机设备的各个组件的硬件实物图,硬件实物图包括多个控件,多个控件与计算 机设备中的各个组件一一对应,每个控件用于显示一个组件的硬件;在硬件实物图中与第二组件相对应的控件被高亮显示。
在一个可能的设计中,第二组件是通过神经网络模型确定的;其中,神经网络模型用于根据报错的组件确定与报错的组件具有连接关系的其他组件是否可能发生故障,以及可能发生故障的组件的排序。
在一个可能的设计中,其他组件包括在组件拓扑图中,第一组件的上游组件以及第一组件的下游组件。
在一个可能的设计中,确定模块1302具体用于针对其他组件中的任意一个组件,若组件存在至少一个可能存在故障的下一级组件,则确定组件可能存在故障。
在一个可能的设计中,第二组件的数量大于1;
确定模块1302还用于对多个第二组件发生故障的概率进行排序;参加图4~图7的描述,此处不再赘述。
输出模块1303还用于输出排序后的多个第二组件。
在一个可能的设计中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;
确定模块1302具体用于,若父组件不具有传感器,且一个或多个子组件的数量大于1,则确定父组件发生故障的概率大于子组件发生故障的概率。
在一个可能的设计中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;
确定模块1302具体用于,若父组件不具有传感器,且子组件的数量等于1,则确定父组件发生故障的概率与子组件发生故障的概率相同。
在一个可能的设计中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;
确定模块1302具体用于,若父组件具有传感器,且父组件的传感器报错,则确定父组件发生故障的概率大于子级组件发生故障的概率。
在一个可能的设计中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;
确定模块1302具体用于,若父组件具有传感器,且父组件的传感器未报错,且子组件的数量大于1,则确定父组件发生故障的概率大于子组件发生故障的概率。
在一个可能的设计中,针对多个第二组件中的任意一个组件集合,组件集合包括一个父组件,以及父组件的一个或多个子组件;
确定模块1302具体用于,若父组件具有传感器,且父组件的传感器未报错,且子组件的数量等于1,则确定父组件发生故障的概率小于子组件发生故障的概率。
在一个可能的设计中,输出模块1303具体用于通过图形界面输出第二组件;其中,图形界面还包括用于指示第二组件排序的编号,编号位于预设区域内。
在一个可能的设计中,第一组件具有传感器;确定模块1302还用于根据传感器确定第一组件已发生故障。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部 分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
本申请实施例中所描述的各种说明性的逻辑单元和电路可以通过通用处理器,数字信号处理器,专用集成电路(ASIC),现场可编程门阵列(FPGA)或其它可编程逻辑装置,离散门或晶体管逻辑,离散硬件部件,或上述任何组合的设计来实现或操作所描述的功能。通用处理器可以为微处理器,可选地,该通用处理器也可以为任何传统的处理器、控制器、微控制器或状态机。处理器也可以通过计算装置的组合来实现,例如数字信号处理器和微处理器,多个微处理器,一个或多个微处理器联合一个数字信号处理器核,或任何其它类似的配置来实现。
本申请实施例中所描述的方法或算法的步骤可以直接嵌入硬件、处理器执行的软件单元、或者这两者的结合。软件单元可以存储于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动磁盘、CD-ROM或本领域中其它任意形式的存储媒介中。示例性地,存储媒介可以与处理器连接,以使得处理器可以从存储媒介中读取信息,并可以向存储媒介存写信息。可选地,存储媒介还可以集成到处理器中。处理器和存储媒介可以设置于ASIC中。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。

Claims (30)

  1. 一种故障检测方法,其特征在于,所述方法应用于计算机设备中,包括:
    获取组件拓扑图,所述组件拓扑图用于描述所述计算机设备中的各个组件,以及各个组件之间的连接关系;
    确定所述组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障;
    输出可能发生故障的第二组件,所述第二组件是所述其他组件和所述第一组件的子集。
  2. 如权利要求1所述的方法,其特征在于,所述组件拓扑图用于描述使用同一通信协议的组件之间的硬件连接关系。
  3. 如权利要求1或2所述的方法,其特征在于,所述输出所述第二组件包括:通过图形界面输出所述第二组件;
    所述图形界面显示有所述组件拓扑图,所述组件拓扑包括多个节点标识,所述多个节点标识与所述计算机设备中的各个组件一一对应;在所述组件拓扑图中与所述第二组件对应的节点标识被高亮显示;或
    所述图形界面显示所述计算机设备的各个组件的硬件实物图,所述硬件实物图包括多个控件,所述多个控件与所述计算机设备中的各个组件一一对应,每个所述控件用于显示一个组件的硬件;在所述硬件实物图中与所述第二组件相对应的控件被高亮显示。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述第二组件是通过神经网络模型确定的;其中,所述神经网络模型用于根据报错的组件确定与所述报错的组件具有连接关系的其他组件是否可能发生故障,以及可能发生故障的组件的排序。
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述其他组件包括在所述组件拓扑图中,所述第一组件的上游组件以及所述第一组件的下游组件。
  6. 如权利要求1-5任一项所述的方法,其特征在于,确定所述组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障,包括:
    针对所述其他组件中的任意一个组件,若所述组件存在至少一个可能存在故障的下一级组件,则确定所述组件可能存在故障。
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述第二组件的数量大于1,所述输出所述第二组件具体包括:
    对多个所述第二组件发生故障的概率进行排序;
    输出排序后的多个所述第二组件。
  8. 如权利要求7所述的方法,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    对多个所述第二组件发生故障的概率进行排序,包括:
    若所述父组件不具有传感器,且所述一个或多个子组件的数量大于1,则确定所述父组件发生故障的概率大于所述子组件发生故障的概率。
  9. 如权利要求7所述的方法,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    对多个所述第二组件发生故障的概率进行排序,包括:
    若所述父组件不具有传感器,且所述子组件的数量等于1,则确定所述父组件发生故 障的概率与所述子组件发生故障的概率相同。
  10. 如权利要求7所述的方法,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    对多个所述第二组件发生故障的概率进行排序,包括:
    若所述父组件具有传感器,且所述父组件的传感器报错,则确定所述父组件发生故障的概率大于所述子级组件发生故障的概率。
  11. 如权利要求7所述的方法,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    对多个所述第二组件发生故障的概率进行排序,包括:
    若所述父组件具有传感器,且所述父组件的传感器未报错,且所述子组件的数量大于1,则确定所述父组件发生故障的概率大于所述子组件发生故障的概率。
  12. 如权利要求7所述的方法,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    对多个所述第二组件发生故障的概率进行排序,包括:
    若所述父组件具有传感器,且所述父组件的传感器未报错,且所述子组件的数量等于1,则确定所述父组件发生故障的概率小于所述子组件发生故障的概率。
  13. 如权利要求7-12任一项所述的方法,其特征在于,所述输出所述第二组件包括:通过图形界面输出所述第二组件;
    所述图形界面还包括用于指示所述第二组件排序的编号,所述编号位于预设区域内。
  14. 如权利要求1-13任一项所述的方法,其特征在于,所述第一组件具有传感器;还包括:
    根据所述传感器确定所述第一组件已发生故障。
  15. 一种故障检测装置,其特征在于,该装置应用于计算机设备中,包括:
    获取模块,用于获取组件拓扑图,所述组件拓扑图用于描述所述计算机设备中的各个组件,以及各个组件之间的连接关系;
    确定模块,用于确定所述组件拓扑图中,与报错的第一组件具有连接关系的其他组件是否可能发生故障;
    输出模块,用于输出可能发生故障的第二组件,所述第二组件是所述其他组件和所述第一组件的子集。
  16. 如权利要求15所述的装置,其特征在于,所述组件拓扑图用于描述使用同一通信协议的组件之间的硬件连接关系。
  17. 如权利要求15或16所述的装置,其特征在于,所述输出模块具体用于通过图形界面输出所述第二组件;
    其中,所述图形界面显示有所述组件拓扑图,所述组件拓扑包括多个节点标识,所述多个节点标识与所述计算机设备中的各个组件一一对应;在所述组件拓扑图中与所述第二组件对应的节点标识被高亮显示;或
    所述图形界面显示所述计算机设备的各个组件的硬件实物图,所述硬件实物图包括多个控件,所述多个控件与所述计算机设备中的各个组件一一对应,每个所述控件用于显示一个组件的硬件;在所述硬件实物图中与所述第二组件相对应的控件被高亮显示。
  18. 如权利要求15-17任一项所述的装置,其特征在于,所述第二组件是通过神经网络 模型确定的;其中,所述神经网络模型用于根据报错的组件确定与所述报错的组件具有连接关系的其他组件是否可能发生故障,以及可能发生故障的组件的排序。
  19. 如权利要求15-18任一项所述的装置,其特征在于,所述其他组件包括在所述组件拓扑图中,所述第一组件的上游组件以及所述第一组件的下游组件。
  20. 如权利要求15-19任一项所述的装置,其特征在于,所述确定模块具体用于针对所述其他组件中的任意一个组件,若所述组件存在至少一个可能存在故障的下一级组件,则确定所述组件可能存在故障。
  21. 如权利要求15-20任一项所述的装置,其特征在于,所述第二组件的数量大于1;
    所述确定模块还用于对多个所述第二组件发生故障的概率进行排序;
    所述输出模块还用于输出排序后的多个所述第二组件。
  22. 如权利要求21所述的装置,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    所述确定模块具体用于,若所述父组件不具有传感器,且所述一个或多个子组件的数量大于1,则确定所述父组件发生故障的概率大于所述子组件发生故障的概率。
  23. 如权利要求21所述的装置,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    所述确定模块具体用于,若所述父组件不具有传感器,且所述子组件的数量等于1,则确定所述父组件发生故障的概率与所述子组件发生故障的概率相同。
  24. 如权利要求21所述的装置,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    所述确定模块具体用于,若所述父组件具有传感器,且所述父组件的传感器报错,则确定所述父组件发生故障的概率大于所述子级组件发生故障的概率。
  25. 如权利要求21所述的装置,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    所述确定模块具体用于,若所述父组件具有传感器,且所述父组件的传感器未报错,且所述子组件的数量大于1,则确定所述父组件发生故障的概率大于所述子组件发生故障的概率。
  26. 如权利要求21所述的装置,其特征在于,针对多个所述第二组件中的任意一个组件集合,所述组件集合包括一个父组件,以及所述父组件的一个或多个子组件;
    所述确定模块具体用于,若所述父组件具有传感器,且所述父组件的传感器未报错,且所述子组件的数量等于1,则确定所述父组件发生故障的概率小于所述子组件发生故障的概率。
  27. 如权利要求21-26任一项所述的装置,其特征在于,所述输出模块具体用于通过图形界面输出所述第二组件;其中,所述图形界面还包括用于指示所述第二组件排序的编号,所述编号位于预设区域内。
  28. 如权利要求15-27任一项所述的装置,其特征在于,所述第一组件具有传感器;所述确定模块还用于根据所述传感器确定所述第一组件已发生故障。
  29. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;
    所述存储器,用于存储计算机程序指令;
    所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求1至14中任一 项所述的方法。
  30. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质被计算设备执行时,所述计算设备执行上述权利要求1至14中任一项所述的方法。
PCT/CN2022/092738 2021-06-30 2022-05-13 一种故障检测方法及装置 WO2023273637A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110732299.X 2021-06-30
CN202110732299.XA CN115542067A (zh) 2021-06-30 2021-06-30 一种故障检测方法及装置

Publications (1)

Publication Number Publication Date
WO2023273637A1 true WO2023273637A1 (zh) 2023-01-05

Family

ID=84692487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092738 WO2023273637A1 (zh) 2021-06-30 2022-05-13 一种故障检测方法及装置

Country Status (2)

Country Link
CN (1) CN115542067A (zh)
WO (1) WO2023273637A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116009480B (zh) * 2023-03-24 2023-06-09 中科航迈数控软件(深圳)有限公司 一种数控机床的故障监测方法、装置、设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5771274A (en) * 1996-06-21 1998-06-23 Mci Communications Corporation Topology-based fault analysis in telecommunications networks
US20080059839A1 (en) * 2003-10-31 2008-03-06 Imclone Systems Incorporation Intelligent Integrated Diagnostics
US20110145647A1 (en) * 2008-08-04 2011-06-16 Youichi Hidaka Trouble analysis apparatus
CN104796273A (zh) * 2014-01-20 2015-07-22 中国移动通信集团山西有限公司 一种网络故障根源诊断的方法和装置
CN107633307A (zh) * 2017-09-08 2018-01-26 国家计算机网络与信息安全管理中心 供配电系统根源告警检测方法、装置、终端及计算机存储介质
CN108494591A (zh) * 2018-03-16 2018-09-04 北京京东金融科技控股有限公司 系统告警处理方法与装置
CN110493042A (zh) * 2019-08-16 2019-11-22 中国联合网络通信集团有限公司 故障诊断方法、装置及服务器
CN110716842A (zh) * 2019-10-09 2020-01-21 北京小米移动软件有限公司 集群故障检测方法和装置
CN111490897A (zh) * 2020-02-27 2020-08-04 华中科技大学 一种针对复杂网络的网络故障分析方法和系统

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5771274A (en) * 1996-06-21 1998-06-23 Mci Communications Corporation Topology-based fault analysis in telecommunications networks
US20080059839A1 (en) * 2003-10-31 2008-03-06 Imclone Systems Incorporation Intelligent Integrated Diagnostics
US20110145647A1 (en) * 2008-08-04 2011-06-16 Youichi Hidaka Trouble analysis apparatus
CN104796273A (zh) * 2014-01-20 2015-07-22 中国移动通信集团山西有限公司 一种网络故障根源诊断的方法和装置
CN107633307A (zh) * 2017-09-08 2018-01-26 国家计算机网络与信息安全管理中心 供配电系统根源告警检测方法、装置、终端及计算机存储介质
CN108494591A (zh) * 2018-03-16 2018-09-04 北京京东金融科技控股有限公司 系统告警处理方法与装置
CN110493042A (zh) * 2019-08-16 2019-11-22 中国联合网络通信集团有限公司 故障诊断方法、装置及服务器
CN110716842A (zh) * 2019-10-09 2020-01-21 北京小米移动软件有限公司 集群故障检测方法和装置
CN111490897A (zh) * 2020-02-27 2020-08-04 华中科技大学 一种针对复杂网络的网络故障分析方法和系统

Also Published As

Publication number Publication date
CN115542067A (zh) 2022-12-30

Similar Documents

Publication Publication Date Title
US10649838B2 (en) Automatic correlation of dynamic system events within computing devices
CN105468484B (zh) 用于在存储系统中确定故障位置的方法和装置
ES2734305T3 (es) Predicción, diagnóstico y recuperación de fallos de aplicaciones en base a patrones de acceso a recursos
US10387239B2 (en) Detecting memory failures in the runtime environment
WO2017129032A1 (zh) 磁盘的故障预测方法和装置
CN109062794B (zh) 一种软件测评结果的确定方法、装置及电子设备
US10069702B2 (en) Dynamic discovery of applications, external dependencies, and relationships
US20200241985A1 (en) Methods, electronic devices, storage systems, and computer program products for error detection
CN111240876B (zh) 微服务的故障定位方法、装置、存储介质及终端
CN104881355A (zh) 一种用于检测测试覆盖的方法和系统
WO2023273637A1 (zh) 一种故障检测方法及装置
US8566689B2 (en) Data integrity units in nonvolatile memory
US9983970B2 (en) Redundant cable routing management in storage systems
US20200012580A1 (en) Storage apparatus, storage system, and performance evaluation method
CN101963931A (zh) 可扩展固件接口下的硬盘测试方法
CN109861863B (zh) 数据中心的连接故障确定方法、装置、电子设备和介质
JP2016076071A (ja) ログ管理装置,ログ管理プログラム,及びログ管理方法
CN106886471A (zh) 一种基于linux中磁盘的读写故障检测方法及系统
CN115033441A (zh) PCIe设备故障检测方法、装置、设备和存储介质
JP6246022B2 (ja) ソフトウェア試験装置及びソフトウェア試験プログラム
CN114706715B (zh) 一种基于bmc的分布式raid的控制方法、装置、设备及介质
TWI816552B (zh) 硬碟性能檢測方法及相關設備
CN117407207B (zh) 一种内存故障处理方法、装置、电子设备及存储介质
US11805039B1 (en) Method and apparatus for detecting degraded network performance
TWI447712B (zh) 可延伸軔體介面下的硬碟測試方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831464

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE