CN118051366A - Fault processing method and computing device - Google Patents

Fault processing method and computing device Download PDF

Info

Publication number
CN118051366A
CN118051366A CN202410140337.6A CN202410140337A CN118051366A CN 118051366 A CN118051366 A CN 118051366A CN 202410140337 A CN202410140337 A CN 202410140337A CN 118051366 A CN118051366 A CN 118051366A
Authority
CN
China
Prior art keywords
pcie
target
equipment
devices
target pcie
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410140337.6A
Other languages
Chinese (zh)
Inventor
邓奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202410140337.6A priority Critical patent/CN118051366A/en
Publication of CN118051366A publication Critical patent/CN118051366A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application discloses a fault processing method and computing equipment, and belongs to the technical field of computers. The risk that the normal operation of the operating system is affected by the fault of the PCIE equipment can be reduced. The method comprises the following steps: sending a heartbeat detection request to target PCIE equipment; the heartbeat detection request is used for requesting the target PCIE equipment to read the appointed information of the target PCIE equipment; the specified information is one or more of four-tuple information; receiving a request response returned by the target PCIE equipment; the request response comprises a result that the target PCIE equipment reads the appointed information; if the result of the target PCIE equipment reading the specified information indicates abnormality, determining that the target PCIE equipment fails, and performing isolation processing on the target PCIE equipment.

Description

Fault processing method and computing device
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a fault processing method and computing equipment.
Background
With the widespread use of peripheral component interconnect express (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) devices on computing devices, failure of a PCIe link may result in failure of other PCIe devices in addition to the failed PCIe device itself.
At present, after a PCIE device fails, reporting the generated downlink port restriction (downstream port containment, DPC) interrupt, and after the computing device receives the interrupt, the PCIE device sending the interrupt may be processed, which leads to a situation that the faulty PCIE device has an influence on the system operation before the faulty PCIE device is processed, and further leads to a failure of the system operation.
Disclosure of Invention
The embodiment of the application provides a fault processing method and computing equipment, which improve the timeliness of determining the PCIE equipment with faults, thereby ensuring the normal operation of a system.
In a first aspect, the present application provides a fault handling method, the method comprising: sending a heartbeat detection request to target PCIE equipment; the heartbeat detection request is used for requesting the target PCIE equipment to read the appointed information of the target PCIE equipment; the specified information is one or more of four-tuple information; receiving a request response returned by the target PCIE equipment; the request response comprises a result that the target PCIE equipment reads the appointed information; if the result of the target PCIE equipment reading the specified information indicates abnormality, determining that the target PCIE equipment fails, and performing isolation processing on the target PCIE equipment.
It can be understood that by actively sending the heartbeat detection request to the target PCIE device, the target PCIE device operates to read the specified information in the four-tuple information according to the heartbeat detection request, and reports the result of reading the specified information by the target PCIE device, so that the target PCIE device is timely determined to fail under the condition that the read information indicates abnormality, and isolation processing can be performed on the failed target PCIE device, so that the problem of system operation breakdown caused by the failure of the target PCIE device is avoided, and the effect of system operation is improved.
In one possible implementation manner, before sending the heartbeat detection request to the target PCIE device, the method further includes: acquiring class identification identifiers of each PCIE device; wherein, PCIE devices of different types correspond to different types of identification marks; the PCIE devices of different types comprise PCIE Switch devices or PCIE terminal EP devices; and determining the target PCIE equipment from the PCIE equipment according to the class identification mark.
It can be understood that, by querying the class identification identifier of each PCIE device, the type of each PCIE device may be determined, that is, the PCIE device is PCIE SWITCH devices or PCIE EP devices, and then, according to the type requirement of the PCIE device that is currently required to perform fault detection, the target PCIE device may be determined from each PCIE device, so as to implement a subsequent heartbeat detection process for the target PCIE device.
In one possible implementation manner, if the target PCIE device includes a plurality of PCIE Switch devices, sending the heartbeat detection request to the target PCIE device includes: heartbeat detection requests are sent to the plurality PCIE SWITCH of devices in parallel by multiple threads.
It can be appreciated that if the target PCIE device includes a plurality of PCIE SWITCH devices, heartbeat detection may be performed on the plurality of PCIE SWITCH devices at the same time, so as to determine PCIE SWITCH devices that generate faults in the plurality of PCIE SWITCH devices in parallel, thereby improving efficiency of determining that faults occur in the PCIE link.
In one possible implementation manner, sending a heartbeat detection request to the target PCIE device includes: sending a heartbeat detection request to the target PCIE equipment according to a specified period; if the result of reading the specified information by the target PCIE device indicates abnormality, determining that the target PCIE device fails includes: if the number of times that the result of the same target PCIE device reading the specified information continuously indicates abnormality is greater than the specified threshold value, determining that the target PCIE device fails.
It can be understood that by periodically sending the heartbeat detection request to the target PCIE device, it is possible to implement that the request response of each heartbeat detection request may be received later, and if the result of continuously reading the specified information indicates that the number of times of abnormality exceeds the specified threshold, it is possible to accurately determine that the target PCIE device fails, and accuracy of determining the failure is improved.
In one possible implementation manner, if the target PCIE device includes PCIE SWITCH devices, performing isolation processing on the target PCIE device includes: and carrying out batch isolation processing on the target PCIE equipment and each PCIE equipment connected with the downlink port of the target PCIE equipment.
It can be understood that, since PCIE SWITCH devices have downstream ports and the downstream ports are connected with other PCIE devices, after determining that the PCIE SWITCH device fails, the PCIE devices connected to the downstream ports of the PCIE SWITCH device and the PCIE SWITCH device can be isolated in batches, so that the system can be ensured to operate normally.
In one possible implementation manner, performing batch isolation processing on the target PCIE device and each PCIE device connected to a downstream port of the target PCIE device includes: according to the corresponding relation, determining each PCIE device connected with the downlink port of the target PCIE device; the corresponding relation is used for indicating PCIE equipment respectively connected with the downlink ports of each PCIE SWITCH equipment; and carrying out batch isolation processing on the target PCIE equipment and each PCIE equipment connected with the downlink port of the target PCIE equipment.
It can be understood that PCIE devices respectively connected to the downstream ports of each PCIE SWITCH devices can be determined through the correspondence, so that PCIE devices connected to the downstream ports can be quickly and accurately determined after the PCIE SWITCH devices with faults are determined, and therefore the isolation processing effect is improved.
In a possible implementation manner, before determining each PCIE device connected to the downstream port of the target PCIE device according to the correspondence, the method further includes: determining a downlink port of each PCIE SWITCH device; determining PCIE equipment connected with each downlink port; generating a corresponding relation according to PCIE equipment connected with the downlink port of each PCIE SWITCH equipment; and writing the corresponding relation into the memory.
It can be understood that before performing heartbeat detection, each PCIE SWITCH downstream port may be obtained in advance, and PCIE devices connected to each downstream port are determined, so that a corresponding relationship between each PCIE SWITCH device and PCIE devices connected to its downstream port is generated, and the corresponding relationship is written into a memory, so that, in a case that it is determined that PCIE SWITCH devices fail, other PCIE devices that need to be isolated and processed except PCIE SWITCH devices are accurately and rapidly determined according to the corresponding relationship.
In one possible implementation manner, if the target PCIE device further includes a PCIE terminal EP device; sending a heartbeat detection request to a target PCIE device, including: sending a heartbeat detection request to a first PCIE EP device; the first PCIE EP device is another PCIE EP device except for the PCIE EP device connected to the downstream port of the PCIE SWITCH devices.
It can be understood that, besides sending the heartbeat detection request to PCIE SWITCH devices to determine whether the PCIE SWITCH devices have faults, the heartbeat detection request can also be sent to other PCIE EP devices except for PCIE EP devices connected to the downstream ports of the PCIE SWITCH devices, so that integrity of fault detection on PCIE links is achieved, and thus, repeated fault detection on the same PCIE devices can be avoided.
In one possible implementation, the four-tuple information includes a device identity, a vendor identity, a subsystem identity, or a subsystem vendor identity.
It can be understood that, because the four-tuple information is an identifier that is used by each PCIE device to indicate an identity, in a case where the PCIE devices operate normally, the PCIE devices may read the respective four-tuple information, so if one or more of the PCIE devices read the four-tuple information is abnormal, it may be determined that the PCIE devices fail.
In a second aspect, an embodiment of the present application provides a fault handling apparatus, where the fault handling apparatus is configured to perform any one of the fault handling methods provided in the first aspect.
In a possible implementation manner, the present application may divide the functional module of the fault handling apparatus according to the method provided in the first aspect. For example, each functional module may be divided for each function, or two or more functions may be inherited in one processing module. The fault handling device may be divided into a sending module, an obtaining module, a processing module, and the like according to functions. The description of possible technical solutions and beneficial effects executed by each of the above-divided functional modules may refer to the technical solutions provided by the first aspect or corresponding possible implementation manners thereof, which are not described herein again.
In a third aspect, embodiments of the present application provide a computing device comprising a processor and a memory for storing computer program instructions that are loaded and executed by the processor to cause the computing device to implement a fault handling method as described in the above aspects.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored therein at least one computer program instruction that is loaded and executed by a processor to implement the fault handling method as described in the above aspects.
In a fifth aspect, embodiments of the present application provide a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform the fault handling methods provided in the various alternative implementations of the above aspects.
For a detailed description of the second to fifth aspects of the present application and various implementations thereof, reference may be made to the detailed description of the first aspect and various implementations thereof; moreover, the advantages of the second aspect and the various implementations thereof may be referred to as analyzing the advantages of the first aspect and the various implementations thereof, and will not be described herein.
These and other aspects of the application will be more readily apparent from the following description.
Drawings
FIG. 1 is a topology block diagram of a PCIE bus, according to an exemplary embodiment;
FIG. 2 is a schematic diagram of a PCIE tree structure involved in the embodiment shown in FIG. 1;
FIG. 3 is a schematic diagram of a fault handling system involved in the embodiment of FIG. 1;
FIG. 4 is a flow diagram illustrating a fault handling method according to an exemplary embodiment;
Fig. 5 is a schematic flow chart of PCIE link failure handling involved in the embodiment shown in fig. 4;
FIG. 6 is a schematic diagram of a PCIe link failure scenario involved in the embodiment shown in FIG. 4;
Fig. 7 is a schematic structural view of a fault handling apparatus according to an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a computing device provided in an exemplary embodiment of the application.
Detailed Description
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In embodiments of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.
Firstly, explanation is made on some concepts related in a fault processing method and a computing device provided by the embodiment of the application:
PCIE device: the PCIe device is a hardware device connected through a slot on the PCIe bus, and the PCIe device may include the following three types of PCIe devices: root Complex (RC) devices, PCIe switch (PCIE SWITCH) devices, and PCIe terminal (PCIe endponit, PCIe EP) devices.
The RC equipment is used for analyzing and generating the PCIe message, receiving an IO instruction from the CPU, generating a corresponding PCIE message, or receiving a PCIE TLP message from the PCIE equipment, and transmitting analysis data to the CPU or the memory.
PCIE SWITCH devices can be used for expanding a PCIE bus, because the PCIE bus adopts an end-to-end connection mode, two ends of each PCIE link can only be connected with one device respectively, if more PCIE devices need to be mounted, PCIE SWITCH devices can be used, and the PCIE SWITCH devices can be an uplink port and a plurality of downlink ports.
PCIe endponit devices may be leaf nodes of a PCIE tree structure. PCIe endponit devices may include network cards, graphics processors (Graphics Processing Unit, GPU), sound cards, graphics cards, smart network cards with independent operating systems, smart GPUs, and the like.
Then, an application scenario of the embodiment of the present application is exemplarily described.
Fig. 1 illustrates a topology structure diagram of a PCIE bus according to an exemplary embodiment of the present application. As shown in fig. 1, the computing device 10 includes a CPU and PCIE devices of various types, where the CPU may be connected to an RC device, the RC device may be directly connected to a PCIE EP device or may be connected to a PCIE SWITCH device, and the PCIE SWITCH device may be connected to a plurality of PCIE EP devices or may be connected to other PCIE SWITCH devices.
Fig. 2 is a schematic diagram of a PCIE tree structure according to an embodiment of the present application. As shown in fig. 2, PCIE SWITCH devices may include an uplink port and a downlink port, for example, PCIE SWITCH device 1 connects with downlink ports of other PCIE devices through an uplink port link through the uplink port; PCIE SWITCH device 1 is connected to the upstream port of PCIE SWITCH device 2 via downstream port 1, and PCIE SWITCH device 1 is connected to the upstream port of PCIe EP device 3 via downstream port 2. Wherein, the uplink port link of PCIE SWITCH device 2 is the same PCIe link as the downlink port link 1 of PCIE SWITCH device 1. PCIE SWITCH device 2 may connect to PCIe EP device 1 and PCIe EP device 2, respectively, through downstream ports. Thereby forming a tree structure of PCIE.
After a PCIE link in a PCIE tree structure fails, there is a possibility that, besides that PCIE devices in the PCIE link that fails cannot work normally, PCIE devices of other PCIE links may also fail, which may affect the service execution effect, and even may cause system operation failure.
In some embodiments, when a certain PCIE device in the computing device fails, the DPC interrupt may be sent to the CPU, and the DPC mechanism may enable the operating system or the DPC driver to process the DPC interrupt, so as to avoid that the fault propagation and diffusion generated by the PCIE device affects other PCIE devices.
However, the implementation of the CPU to determine the PCIE device that has failed requires the PCIE device to support DPC driving and the operating system also requires to support DPC driving. Therefore, on the one hand, because DPC is a mechanism for detecting the fault condition of the downlink port of the PCIE device and recovering technology, if the PCIE device exists in the environment and is an uplink port abnormality, the DPC mechanism alone cannot detect the fault and achieves the purpose of isolating batch of PCIE devices with faults. On the other hand, when the PCIE device reports the DPC interrupt to the CPU, the PCIE device has failed, even has failed for a period of time, and there is a certain hysteresis in the manner of passively uploading the PCIE state, in this case, since both fault detection and isolation operations have hysteresis, an operation error may be caused to occur in the operating system.
In view of this, the following embodiments of the present application further provide a fault processing method, where the computing device may actively send a heartbeat detection request to the PCIE device through the CPU, so that the PCIE device reads one or more of its four-tuple information according to the received heartbeat detection request, and then the PCIE device feeds back a read result to the CPU, where the result indicates that the PCIE device is faulty, and the PCIE device may be isolated. Because the CPU can actively send the heartbeat detection request to the PCIE equipment, the PCIE equipment reads the four-tuple information, if the PCIE equipment operates abnormally, the PCIE equipment reads the result of the four-tuple information according to the heartbeat detection request, thereby realizing the process of actively detecting whether the PCIE equipment has faults or not, avoiding the problem of hysteresis in fault determination caused by generating and reporting interruption after waiting for the PCIE equipment to have faults, ensuring the efficiency of fault processing, timely isolating the fault PCIE equipment and the PCIE equipment related to the faults, and reducing the risk that the faults of the PCIE equipment influence the normal operation of an operating system.
Fig. 3 is a schematic structural diagram of a fault handling system according to an embodiment of the present application. As shown in fig. 3, the fault handling system may be running on a computing device 10, which computing device 10 may include a CPU, memory, and different types of PCIe devices including RC devices, PCIE SWITCH devices, PCIe EP devices. Registers for storing PCIE configuration space are included in each type of PCIE device.
The PCIE configuration space may be divided into two types, type0 and type1, respectively. The type of PCIe configuration space in the register of the PCIe EP device may be type0; the type of PCIE configuration space in the registers of the RC device and PCIE SWITCH devices may be type1.
Wherein, the PCIE configuration space of type0 includes a Device identifier (Device ID), a vendor identifier (Version ID), a Subsystem vendor identifier (Subsystem Vendor ID), and a Subsystem identifier (Subsystem ID); the PCIE configuration space of type1 may include a Device identification code (Device ID) and a vendor identification code (Version ID).
When the PCIE devices operate normally, each PCIE Device may acquire quad information included in each PCIE configuration space from each PCIE configuration space, where the quad information may include a Device ID, a Version ID, subsystem Vendor ID, and a Subsystem ID.
The four-tuple information of the PCIE device may be used to identify the PCIE device. The Device ID is a code identified for the Device itself, which is used to distinguish between different models of the same type of Device. The Version ID is an identification code representing a technical vendor and is a unique vendor identification. Subsystem Vendor ID is the identity of the secondary manufacturer, which is also the unique manufacturer identity. The Subsystem ID is a secondary code authorizing the device to be manufactured, not the code of the device of the original manufacturer, but the code of the device of the secondary manufacturer's substitution work.
Because the four-tuple information is information for indicating the identity of the PCIE device, the PCIE device may perform normal operation only after the four-tuple information is normally identified, that is, the CPU may guide the operating system to detect the PCIE device state, discover the PCIE device, install drivers of different versions of the PCIE device, and so on through identifying the four-tuple information, so as to ensure normal and safe operation of the PCIE device in the system platform. Therefore, under the condition that the operation of the PCIE device fails, the CPU cannot read normal four-tuple information.
Optionally, the CPU may send a heartbeat detection request to each PCIE Device, after receiving the heartbeat detection request, the PCIE Device obtains, according to the heartbeat detection request, specified information in the PCIE configuration space, where the specified information may be one or more of a Device ID, a Version ID, subsystem Vendor ID, and a Subsystem ID, and then returns the obtained result to the CPU, where if it is determined that the result indicates that an abnormality may determine that the corresponding PCIE Device is faulty, thereby isolating the PCIE Device.
Fig. 4 is a flow chart illustrating a fault handling method according to an exemplary embodiment of the present application. The fault handling method may be applied to a computing device, for example, the computing device 10 shown in fig. 1. The fault processing method comprises the following steps:
S101, sending a heartbeat detection request to the target PCIE equipment.
In the embodiment of the application, the computing device sends a heartbeat detection request to the target PCIE device through the CPU.
The heartbeat detection request may be used to request the target PCIE device to read the specified information of the target PCIE device, and the execution information may be one or more of four-tuple information.
Alternatively, the quadruple information may include a Device ID, version ID, subsystem Vendor ID, or Subsystem ID.
The four-tuple information of the PCIE devices is stored in a PCIE configuration space of the PCIE devices, and the four-tuple information stored in PCIE configuration spaces of PCIE devices of different types may be different. For example, PCIE configuration spaces of RC devices and PCIE SWITCH devices include Device IDs and Version IDs; PCIe configuration space of PCIe EP Device includes Device ID, version ID, subsystem Vendor ID, and Subsystem ID.
That is, if the target PCIE Device includes an RC Device or PCIE SWITCH devices, the specified information may include one or more of a Device ID and a Version ID; if the target PCIE Device includes a PCIE EP Device, the designation information may include one or more of a Device ID, a Version ID, subsystem Vendor ID, and a Subsystem ID.
In one possible implementation, the computing device may send, by the CPU, a heartbeat detection request to the target PCIE device at a specified period.
The timer task is set in the computing device, so that the heartbeat detection request can be sent to the target PCIE device according to the specified period.
That is, the computing device may trigger sending the heartbeat detection request to the target PCIE device at regular time according to the timer task, so as to achieve the purpose of performing periodic fault detection on the target PCIE device.
For example, the timer task may instruct the CPU to send a heartbeat detection request to the target PCIE device every 0.5s, that is, every 0.5s causes the target PCIE device to read the specified information in the four-tuple information stored in the PCIE configuration space of the target PCIE device.
In one possible scenario, the target PCIE device may include an RC device, and since each PCIE SWITCH device in the computing device and the PCIE EP device are connected in the downlink of the RC device, in order to save time and cost of fault detection, whether the RC device fails may be preferentially detected, that is, the computing device may preferentially send a heartbeat detection request to the RC device through the CPU.
For example, as shown in fig. 4, if the computing device preferentially sends a heartbeat detection request to the RC device through the CPU, it may be determined whether the RC device fails. If the RC equipment is determined to be faulty, the PCIE equipment connected in the downlink of the RC equipment can be directly isolated later, and because the PCIE equipment (PCIE SWITCH equipment 1, PCIE SWITCH equipment 2, PCIE SWITCH equipment 1, PCIe EP equipment 1 connected with the downlink port of PCIE SWITCH equipment 2, PCIe EP equipment 2 and PCIe EP equipment 4 connected with the downlink port of the RC equipment) connected in the downlink of the RC equipment are all PCIE equipment in the computing equipment, fault detection is not needed. If it is determined that the RC device has not failed, the computing device may continue to send heartbeat detection requests to each PCIE SWITCH device to complete the failure detection for each PCIE SWITCH device.
In one possible scenario, the target PCIE device may include PCIE SWITCH devices, since PCIE SWITCH devices have one uplink port and multiple downlink ports, after detecting that PCIE SWITCH devices are faulty, it may be determined that the uplink port link of PCIE SWITCH devices is faulty, and that the downlink port link of PCIE SWITCH devices is also faulty.
If the target PCIE device includes PCIE SWITCH devices and the computing device includes a plurality of PCIE SWITCH devices, the computing device may send heartbeat detection requests to a plurality of PCIE SWITCH devices through multithreading of the CPU in parallel. So as to realize the parallel fault detection of each PCIE SWITCH equipment and save the time of fault detection.
For example, as shown in fig. 4, if the computing device determines that the RC device has not failed, sending heartbeat detection requests to each PCIE SWITCH device (PCIE SWITCH device 1 and PCIE SWITCH device 2) may determine whether each PCIE SWITCH device has failed. If it is determined that PCIE SWITCH device 1 fails, then the PCIE devices connected in the downlinks of PCIE SWITCH device 1 and PCIE SWITCH device 1 may be directly isolated later, where each PCIE device connected in the downlinks of PCIE SWITCH device 1 includes PCIE SWITCH device 2 connected to the downport of PCIE SWITCH device 1, PCIE EP device 1 connected to the downport of PCIE SWITCH device 2, and PCIE EP device 2 connected to the downport of PCIE EP device 2, PCIE SWITCH device 1. After the computing device performs batch isolation processing on each PCIE device connected in the downlinks of PCIE SWITCH devices 1 and PCIE SWITCH devices 1, the computing device may continue to send a heartbeat detection request to the PCIE EP device 4 that is not detected by the fault and is not isolated, so as to complete fault detection on each PCIE device in the computing device. In another possible case, the target PCIE device may further include a PCIE EP device, where the PCIE EP device may be a PCIE EP device directly connected to the RC device, or may be a PCIE EP device connected to a downstream port of the PCIE SWITCH devices. Since PCIe EP devices have one upstream port and no downstream port, if the PCIe EP device fails, no downstream device is affected in PCIe link, and since processing is already performed on PCIe EP devices connected to downstream ports of the failed PCIE SWITCH device during the process of performing the failure heartbeat detection of PCIE SWITCH devices, after the failure detection of PCIE SWITCH devices, PCIe EP devices that are not isolated are determined, where PCIe EP devices that are not isolated may include PCIe EP devices connected to downstream ports of PCIE SWITCH devices that are not detected to be failed, and PCIe EP devices directly connected to RC devices, and heartbeat detection requests are sent to the PCIe EP devices.
That is, first, the computing device may send heartbeat detection requests to the plurality PCIE SWITCH of devices in parallel through the multithreading of the CPU, and after determining whether each PCIE SWITCH device fails according to the heartbeat detection requests, since the PCIe EP device may be connected to the downstream port of the PCIE SWITCH device, the computing device may perform batch isolation processing on each PCIe EP device that determines that the PCIE SWITCH device fails and that determines that the downstream port of the PCIE SWITCH device fails. And then, the computing device can send heartbeat detection requests to a plurality of non-isolated PCIe EP devices through multithreading of the CPU, so that whether the fault PCIe EP devices exist in the non-isolated PCIe EP devices or not is determined, and further isolation of the fault PCIe EP devices is realized.
For example, as shown in fig. 4, first, the computing device may send heartbeat detection requests to PCIE SWITCH devices 1 and PCIE SWITCH device 2 in parallel, determine whether PCIE SWITCH device 1 and PCIE SWITCH device 2 are malfunctioning according to the heartbeat detection requests, and if it is determined that PCIE SWITCH device 1 is not malfunctioning and PCIE SWITCH device 2 is malfunctioning, the computing device may perform batch isolation on PCIe EP devices 1 and PCIe EP devices 2 connected to downstream ports of PCIE SWITCH devices 2 and PCIE SWITCH device 2. Then, the computing device may send heartbeat detection requests to non-isolated PCIe EP devices, namely PCIe EP device 3 and PCIe EP device 4, determine whether PCIe EP device 3 and PCIe EP device 4 send a failure, and similarly, if PCIe EP device 3 is determined to fail, isolate PCIe EP device 3; if it is determined that the PCIe EP device 4 is malfunctioning, the PCIe EP device 4 is isolated. If the target PCIE device may further include a PCIE EP device, sending a heartbeat detection request to the first PCIE EP device, where the first PCIE EP device may be another PCIE EP device except for the PCIE EP device marked as the isolated processing.
For example, as shown in fig. 4, if the computing device first sends a heartbeat detection request to each PCIE SWITCH device and the RC device, it is determined that the device 1 of PCIE SWITCH fails, and the computing device may perform isolation processing on each PCIE device connected to the downstream ports of the device 1 of PCIE SWITCH and PCIE SWITCH. That is, the computing device may quarantine PCIE SWITCH device 1, PCIE SWITCH device 2, PCIe EP device 1, PCIe EP device 2, and PCIe EP device 3. Because the computing device has not yet detected whether the PCIe EP device has failed, then the computing device may send a heartbeat detection request to the first PCIe EP device (e.g., PCIe EP device 4) that has not been subjected to the isolation processing, so as to complete the purpose of fault detection for all PCIe devices in the computing device. Similarly, if the target PCIE device includes PCIE EP devices and the computing device includes a plurality of PCIE EP devices that are not isolated, the computing device may send, through multithreading of the CPU, a heartbeat detection request to the plurality of PCIE EP devices in parallel. The fault detection of each PCIe EP device is realized in parallel, so that the time for fault detection is saved.
S102, receiving a request response returned by the target PCIE equipment.
In the embodiment of the application, the computing device can receive the request response returned by each target PCIE device through the CPU.
The request response may include a result that the target PCIE device reads the specified information.
For example, if the target PCIE Device includes PCIE SWITCH devices or RC devices, the specified information may include one or more of a Device ID and a Version ID, and the request response returned by the PCIE SWITCH Device or the RC Device may be a specific value of the specified information, or may also be a null value or an abnormal value. If the target PCIE Device includes a PCIE EP Device, the specified information may include one or more of a Device ID, a Version ID, subsystem Vendor ID, and a Subsystem ID, and the request response returned by the PCIE EP Device may be a specific value of the specified information, or may also be a null value or an abnormal value.
When the request response returned by PCIE SWITCH devices or RC devices is a specific value of the specified information, the PCIE SWITCH devices or RC devices can be indicated to read the specified information successfully, namely, the PCIE SWITCH devices or RC devices are not failed; when the request response returned by PCIE SWITCH devices or RC devices is null or abnormal, the PCIE SWITCH devices or RC devices may be indicated to fail to read the specified information, that is, the PCIE SWITCH devices or RC devices indicate that the heartbeat detection is abnormal, which indicates that PCIE SWITCH devices or RC devices may fail. When the request response returned by the PCIe EP device is a specific value of the specific information, the PCIe EP device may be indicated to read the specific information successfully, that is, it indicates that the PCIe EP device has no fault; when the request response returned by the PCIe EP device is a null value or an abnormal value, the PCIe EP device may be indicated to fail to read the specified information, that is, the PCIe EP device indicates that the heartbeat detection of the PCIe EP device indicates an abnormality, which indicates that the PCIe EP device may fail.
For example, when the computing device receives a request response returned by the PCIE SWITCH device through the CPU to be 0xffff, it may be determined that the PCIE SWITCH device is abnormal in this heartbeat detection indication, which indicates that the PCIE SWITCH device may fail.
In one possible implementation manner, if the computing device sends, through the CPU, a heartbeat detection request to the target PCIE device according to a specified period, and if the number of times that the result of reading the specified information by the same target PCIE device continuously indicates an abnormality is greater than a specified threshold, it is determined that the target PCIE device fails.
That is, if the computing device sends a heartbeat detection request to the PCIE SWITCH device every 0.5s through the CPU, after the PCIE SWITCH device receives the first heartbeat detection request, the PCIE SWITCH device may continue to obtain the second heartbeat detection request and read the result of the specified information according to the heartbeat detection request again, if the result indicates the abnormality, the PCIE SWITCH device may continue to obtain the third heartbeat detection request and read the result of the specified information according to the third heartbeat detection request, if the result still indicates the abnormality, it may be determined that the PCIE SWITCH device fails.
And S103, if the result of the target PCIE equipment reading the specified information indicates abnormality, determining that the target PCIE equipment fails, and performing isolation processing on the target PCIE equipment.
In the embodiment of the application, if the result of the target PCIE device obtained by the computing device through the CPU and reading the specified information indicates that the target PCIE device is abnormal, it may be determined that the target PCIE device is faulty and the target PCIE device is isolated.
In one possible implementation manner, if the number of times that the result of reading the specified information by the same target PCIE device continuously indicates the abnormality is greater than the specified threshold, it may be determined that the target PCIE device fails.
If the target PCIE device includes PCIE SWITCH devices or RC devices, the computing device may perform batch isolation processing on the target PCIE device and each PCIE device connected to a downstream port of the target PCIE device through the CPU. If the target PCIE device includes a PCIE EP device, the computing device may perform isolation processing on the target PCIE device through the CPU.
Under a possible condition, the computing device can acquire the corresponding relation through the CPU, and can determine each PCIE device connected with the downlink port of the target PCIE device according to the corresponding relation, so that the process of determining each PCIE device connected with the downlink port of the PCIE SWITCH device after determining the PCIE SWITCH device with the fault is realized, and the batch isolation process is facilitated.
The above correspondence may be used to indicate PCIE devices to which the downstream ports of each PCIE SWITCH devices are respectively connected. The corresponding relation can be started after the computing equipment is electrified, and class identification identifiers of all PCIE equipment are obtained; wherein, PCIE devices of different types correspond to different types of identification marks; the PCIE devices of different types may include PCIE Switch devices or PCIE terminal EP devices. And then the computing device can determine the target PCIE device from the PCIE devices according to the class identification mark through the CPU. That is, after PCIE SWITCH devices are determined from the PCIE devices, the downstream port of each PCIE SWITCH device may be determined, the PCIE device connected to each downstream port is determined, then a corresponding relationship is generated according to the PCIE device connected to the downstream port of each PCIE SWITCH device, and then the corresponding relationship is written into the memory, so that the CPU invokes the corresponding relationship in the memory.
Fig. 5 is a schematic flow chart of PCIE link failure processing according to an embodiment of the present application. As shown in fig. 5, the steps of the computing device for performing fault handling on the PCIE link are as follows.
S11, the computing device determines PCIE SWITCH devices.
After the computing device is powered on and starts running, the computing device may obtain, through the CPU, a class identification identifier (ClassID) of each PCIE device, for example, classID of each PCIE SWITCH device is 0x0604. The respective type of each PCIE device may then be determined from ClassID to determine PCIE SWITCH devices.
S12, the computing device creates an uplink port object.
The computing device determines that the type of the PCIE device is each PCIE SWITCH device of the PCIE SWITCH device according to the ClassID instruction, obtains the uplink ports of each PCIE SWITCH device, and creates an object for each uplink port, that is, performs assignment records for each uplink port.
For example, using fig. 4 as an example, the computing device may determine that the computing device includes PCIE SWITCH device 1 and PCIE SWITCH device 2, and may assign a to the upstream port of PCIE SWITCH device 1 and b to the upstream port of PCIE SWITCH device 2, respectively.
S13, the computing device creates a downlink port object list.
Wherein the computing device may traverse the downstream ports of the respective PCIE SWITCH devices and create an object list.
Because one PCIE SWITCH device may have multiple downstream ports and multiple PCIe devices are hung under the downstream port, the uplink port object and the downstream port object list need to form a dictionary corresponding to one another, so as to form a one-to-many correspondence relationship, that is, the uplink port of one PCIE SWITCH device corresponds to the PCIe device hung under the multiple downstream ports.
For example, using fig. 4 as an example, the computing device may traverse PCIE SWITCH the respective downstream ports of device 1 and PCIE SWITCH, assign m and n to the downstream port of PCIE SWITCH device 1, and assign x and y to the downstream port of PCIE SWITCH device 2. The correspondence may include that the uplink port a corresponds to the downlink ports m and n, and the uplink port b corresponds to the downlink ports x and y.
S14, the computing device starts a heartbeat detection timer task.
Wherein the computing device may initiate a timer task such that the detection task is run at intervals of polling, i.e., sending heartbeat detection requests to the respective PCIE SWITCH devices.
For example, taking fig. 4 as an example, the computing device may send heartbeat detection requests to the upstream port a of PCIE SWITCH device 1 and the upstream port b of PCIE SWITCH device 2 every 0.5s according to the timer task.
S15, the computing equipment judges whether each uplink port is abnormal in heartbeat.
The method for detecting whether the heartbeat of the uplink port receiving the heartbeat detection request is abnormal by the computing device may be that whether the uplink port is abnormal is determined by whether vendorID/DeviceID in the configuration space is read by the PCIE SWITCH device where the uplink port is located (for example, an abnormal value of vendorID/DeviceID is 0 xffff), if vendorID/DeviceID in the configuration space is read by the PCIE SWITCH device, the heartbeat abnormal accumulation count of the uplink port of the PCIE SWITCH device is increased by 1, and if the accumulated heartbeat abnormality is continuous and exceeds 3 times, the uplink port link fault may be marked. Otherwise, judging that the uplink port link does not fail.
S16, the computing equipment isolates all equipment connected with the downlink port corresponding to the uplink port marked as the fault.
The computing device may obtain, according to the correspondence recorded in S13, all downlink ports corresponding to the uplink port marked as a fault and PCIE devices hung under each downlink port, and perform batch isolation operation on PCIE devices hung under all downlink ports corresponding to the uplink port marked as a fault by marking all the PCIE devices hung under the downlink port.
S17, the computing equipment judges whether each downlink port is abnormal in heartbeat.
The computing device may detect whether vendorID/DeviceID of each PCIE EP device suspended from PCIE SWITCH devices is abnormal (for example, an abnormal value of vendorID/DeviceID is 0 xffff) by traversing the downlink port list, if vendorID/DeviceID in the PCIE EP device reading configuration space is abnormal, add 1 to the heartbeat anomaly accumulation count, and if the accumulated heartbeat anomaly is continuous and exceeds 3 times, mark a downlink port link failure of PCIE SWITCH devices.
S18, the computing device isolates all devices connected with the downstream ports marked as faults.
The isolation of the devices connected with all the downstream ports marked as faulty can be achieved by removing all the PCIE EP devices suspended from the downstream ports marked as faulty.
Then, the computing device may further determine, according to the ClassID indication and whether the first PCIE EP device is in the corresponding relationship, that is, the PCIE EP device directly connected to the downlink port of the RC device, and directly determine the link state of the first PCIE EP device according to heartbeat detection, which is not described herein.
The above-mentioned process of performing PCIE link failure detection may include the following three failure and isolation processing cases. In the first case, the computing device may detect PCIE SWITCH whether the uplink port link or the downlink port link of the device is abnormal by sending a heartbeat detection request to the uplink port or the downlink port of the PCIE SWITCH device according to the steps S11 to S18 shown in fig. 5. And when the uplink port link abnormality of the PCIE SWITCH equipment is detected, carrying out batch isolation processing on each PCIe equipment in the PCIE SWITCH equipment and the downlink port link. For example, as shown in fig. 6, if the computing device performs heartbeat detection on PCIE SWITCH devices 2, and determines that the heartbeat is abnormal and the accumulated heartbeat is continuous and exceeds a specified threshold, it may determine that the PCIE SWITCH devices 2 are faulty, that is, the uplink ports are abnormal, and batch isolation processing may be performed on each PCIE device (for example, FPGA, GPU, and Other devices) suspended under the PCIE SWITCH devices 2 and PCIE SWITCH devices 2. Or the computing device performs heartbeat detection on each PCIE device hung under PCIE SWITCH device 1, if the FPGA heartbeat abnormality is determined, the accumulated heartbeat abnormality is continuous and exceeds a specified threshold, the FPGA can be determined to be faulty, that is, the downlink port link abnormality can be determined, and isolation processing can be performed on the FPGA.
In the second case, the computing device may directly send a heartbeat detection request to the PCIE EP device, so as to detect whether the link end device, that is, the PCIE EP device link has an anomaly. And when detecting that the PCIE EP device links are abnormal, performing isolation processing on the PCIE EP device. For example, as shown in fig. 6, the computing device may perform heartbeat detection on the RAID card, that is, send a heartbeat detection request to the RAID card, and if it is determined that the RAID card is abnormal in heartbeat and the cumulative heartbeat abnormality is continuous and exceeds a specified threshold, it may determine that the RAID card is faulty, may directly remove the RAID card, or isolate the RAID card.
In the third case, when the PCIE device in the computing device fails, the DPC device may actively send the DPC interrupt to the CPU, and through monitoring the DPC interrupt in real time, and through using the DPC mechanism, the operating system or the DPC driver may process the DPC interrupt, so as to avoid the fault propagation and diffusion generated by the PCIE device from affecting other PCIE devices. The fault detection method for loading the DPC driver and monitoring the DPC fault of the PCIe device in real time can be used as a spam scheme, and after the computing device completes the fault detection of the PCIE link according to the first condition and the second condition, in order to prevent missing or error of the fault detection, the computing device monitors whether the PCIE device feeds back the DPC interrupt in real time according to the third condition so as to ensure that the computing device can timely process the fault of the PCIE link and avoid affecting the normal operation of the system.
In summary, the computing device may actively send the heartbeat detection request to the PCIE device through the CPU, so that the PCIE device reads the four-tuple information, if the PCIE device operates abnormally, the PCIE device reads the result of the four-tuple information according to the heartbeat detection request, thereby implementing a process that the CPU actively detects whether the PCIE device has a fault, avoiding a problem that the fault determination caused by generating and reporting an interrupt after waiting for the PCIE device to have a fault has hysteresis, thereby ensuring the efficiency of fault processing, and timely performing isolation processing on the faulty PCIE device and the PCIE device related to the fault, so as to reduce the risk that the fault of the PCIE device affects the normal operation of the operating system.
The foregoing description of the embodiments of the present application has been presented primarily in terms of methods. It will be appreciated that the fault handling means, in order to achieve the above described functions, comprise at least one of a hardware structure and a software module for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional units of the fault processing device according to the method example, for example, each functional unit can be divided corresponding to each function, or two or more functions can be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.
Fig. 7 is a schematic diagram illustrating a configuration of a fault handling apparatus 200 according to an exemplary embodiment of the present application. The fault handling apparatus 200 is applied to a computing device, and the fault handling apparatus 200 includes:
A sending module 410, configured to send a heartbeat detection request to a target PCIE device; the heartbeat detection request is used for requesting the target PCIE device to read the appointed information of the target PCIE device; the specified information is one or more of four-tuple information;
a receiving module 420, configured to receive a request response returned by the target PCIE device; the request response comprises a result of the target PCIE device reading the specified information;
And the processing module 430 is configured to determine that the target PCIE device fails if the result of the target PCIE device reading the specified information indicates abnormality, and perform isolation processing on the target PCIE device.
In one possible implementation, the apparatus further includes: the acquisition module is used for acquiring class identification identifiers of each PCIE device before sending a heartbeat detection request to the target PCIE device; wherein, different types of PCIE equipment correspond to different class identification marks; the PCIE devices of different types comprise PCIE Switch devices or PCIE terminal EP devices; and determining the target PCIE equipment from the PCIE equipment according to the class identification mark.
In a possible implementation manner, if the target PCIE device includes a plurality of PCIE Switch devices, the sending module 410 is further configured to send the heartbeat detection request to the plurality PCIE SWITCH of devices in parallel through multithreading.
In a possible implementation manner, the sending module 410 is further configured to send the heartbeat detection request to the target PCIE device according to a specified period;
the processing module 430 is further configured to determine that the target PCIE device fails if the number of times that the result of reading the specified information by the same target PCIE device continuously indicates abnormality is greater than a specified threshold.
In a possible implementation manner, if the target PCIE device includes PCIE SWITCH devices, the processing module 430 is further configured to perform batch isolation processing on the target PCIE device and each PCIE device connected to a downstream port of the target PCIE device.
In a possible implementation manner, the processing module 430 is further configured to determine, according to a correspondence, each PCIE device connected to a downstream port of the target PCIE device; the corresponding relation is used for indicating PCIE equipment respectively connected with the downlink ports of each PCIE SWITCH equipment;
and carrying out batch isolation processing on the target PCIE equipment and each PCIE equipment connected with the downlink port of the target PCIE equipment.
In one possible implementation, the apparatus further includes: a generating module, configured to determine a downlink port of each PCIE SWITCH device; determining PCIE equipment connected with each downlink port; generating the corresponding relation according to PCIE equipment connected with the downlink port of each PCIE SWITCH equipment; and writing the corresponding relation into a memory.
In one possible implementation manner, if the target PCIE device further includes a PCIE terminal EP device; the sending module 410 is further configured to send a heartbeat detection request to the first PCIE EP device; the first PCIE EP device is another PCIE EP device except for the PCIE EP device connected to the downstream port of the PCIE SWITCH devices.
In one possible implementation, the four-tuple information includes a device identity, a vendor identity, a subsystem identity, or a subsystem vendor identity.
For a specific description of the above alternative modes, reference may be made to the foregoing method embodiments, and details are not repeated here. In addition, the explanation and the description of the beneficial effects of any fault handling apparatus provided above may refer to the corresponding method embodiments described above, and will not be repeated.
As an example, in connection with fig. 3, the functions implemented by part or all of the transmitting module 410, the receiving module 420, and the processing module 430 in the fault handling device may be implemented by the CPU in fig. 3 executing program code.
Fig. 8 illustrates a schematic diagram of a computing device 1100 provided by an exemplary embodiment of the application. The computing device 1100 may be an electronic device such as a server, smart phone, tablet, electronic book, portable personal computer, smart wearable device, or the like. Computing device 1100 of the present application may include one or more of the following components: a processor 1110 and a memory 1120.
Processor 1110 may include one or more processing cores. The processor 1110 utilizes various interfaces and lines to connect various portions of the overall terminal, perform various functions of the terminal, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1120, and invoking data stored in the memory 1120. Alternatively, the processor 1110 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-programmable gate array (FPGA), programmable logic array (programmable logic array, PLA). The processor 1110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (graphics processing unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 1110 and may be implemented solely by a single communication chip.
The memory 1120 may include random access memory (random access memory, RAM) or read-only memory (ROM). Optionally, the memory 1120 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 1120 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 1120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, which may be an Android (Android) system (including a system developed based on an Android system), an IOS system developed by apple corporation (including a system developed based on an IOS system depth), or other systems, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described various method embodiments, and the like. The storage data area may also store data created by the terminal in use (such as phonebook, audio-video data, chat-record data), etc.
In addition, those skilled in the art will appreciate that the structure of the computing device 1100 illustrated in the above-described figures does not constitute a limitation of the computing device 1100, and the computing device 1100 may include more or less components than illustrated, or may combine certain components, or may be arranged of different components. For example, the terminal further includes components such as a radio frequency circuit, a shooting component, a sensor, an audio circuit, a wireless fidelity (WIRELESS FIDELITY, wiFi) component, a power supply, a bluetooth component, and the like, which are not described herein.
Embodiments of the present application also provide a computer readable storage medium having stored therein at least one computer instruction that is loaded and executed by a processor to implement the PCIe device management method as described in the above embodiments. For the explanation of the relevant content and the description of the beneficial effects in any of the above-mentioned computer-readable storage media, reference may be made to the above-mentioned corresponding embodiments, and the description thereof will not be repeated here.
The embodiment of the application also provides a chip. The chip has integrated therein a control circuit and one or more ports for implementing the functions of the fault handling apparatus described above. Optionally, the functions supported by the chip may be referred to above, and will not be described herein. Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments may be implemented by a program to instruct associated hardware. The program may be stored in a computer readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a microprocessor (DIGITAL SIGNAL processor, DSP), a field programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the methods of the above embodiments. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), etc.
It should be noted that the above-mentioned devices for storing computer instructions or computer programs, such as, but not limited to, the above-mentioned memories, computer-readable storage media, communication chips, and the like, provided by the embodiments of the present application all have non-volatility. Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (10)

1. A method of fault handling, the method comprising:
Sending a heartbeat detection request to target PCIE equipment; the heartbeat detection request is used for requesting the target PCIE device to read the appointed information of the target PCIE device; the specified information is one or more of four-tuple information;
receiving a request response returned by the target PCIE equipment; the request response comprises a result of the target PCIE device reading the specified information;
if the result of the target PCIE equipment reading the specified information indicates abnormality, determining that the target PCIE equipment fails, and performing isolation processing on the target PCIE equipment.
2. The method of claim 1, wherein prior to sending the heartbeat detection request to the target PCIE device, further comprising:
acquiring class identification identifiers of each PCIE device; wherein, different types of PCIE equipment correspond to different class identification marks; the PCIE devices of different types comprise PCIE Switch devices or PCIE terminal EP devices;
And determining the target PCIE equipment from the PCIE equipment according to the class identification mark.
3. The method of claim 1 or 2, wherein if the target PCIE device includes a plurality of PCIE Switch devices, the sending the heartbeat detection request to the target PCIE device includes:
And sending the heartbeat detection request to the plurality PCIE SWITCH of devices in parallel through multiple threads.
4. The method of any of claims 1 to 3, wherein the sending a heartbeat detection request to a target PCIE device comprises:
sending the heartbeat detection request to the target PCIE equipment according to a specified period;
If the result of the target PCIE device reading the specified information indicates abnormality, determining that the target PCIE device fails includes:
And if the number of times that the result of the same target PCIE device reading the specified information continuously indicates abnormality is greater than a specified threshold value, determining that the target PCIE device fails.
5. The method of any of claims 1 to 4, wherein if the target PCIE device includes PCIESWITCH devices, the performing isolation processing on the target PCIE device includes:
and carrying out batch isolation processing on the target PCIE equipment and each PCIE equipment connected with the downlink port of the target PCIE equipment.
6. The method of claim 5, wherein the performing batch isolation processing on the target PCIE device and each PCIE device connected to a downstream port of the target PCIE device comprises:
Determining each PCIE device connected with a downlink port of the target PCIE device according to the corresponding relation; the corresponding relation is used for indicating PCIE equipment respectively connected with the downlink ports of each PCIE SWITCH equipment;
and carrying out batch isolation processing on the target PCIE equipment and each PCIE equipment connected with the downlink port of the target PCIE equipment.
7. The method of claim 6, wherein before determining, according to the correspondence, the PCIE devices connected to the downstream ports of the target PCIE device, the method further comprises:
Determining a downlink port of each PCIE SWITCH device;
determining PCIE equipment connected with each downlink port;
Generating the corresponding relation according to PCIE equipment connected with the downlink port of each PCIE SWITCH equipment;
and writing the corresponding relation into a memory.
8. The method of any one of claims 1 to 7, wherein if the target PCIE device further includes a PCIE terminal EP device; the sending the heartbeat detection request to the target PCIE device includes:
sending a heartbeat detection request to a first PCIE EP device; the first PCIE EP device is another PCIE EP device except for the PCIE EP device connected to the downstream port of the PCIE SWITCH devices.
9. The method of any of claims 1 to 8, wherein the four-tuple information comprises a device identity, a vendor identity, a subsystem identity, or a subsystem vendor identity.
10. A computing device, the computing device comprising: a processor and a memory for storing instructions executable by the processor; the processor is configured to execute the instructions to cause the computing device to perform the fault handling method of any of claims 1-9.
CN202410140337.6A 2024-01-31 2024-01-31 Fault processing method and computing device Pending CN118051366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410140337.6A CN118051366A (en) 2024-01-31 2024-01-31 Fault processing method and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410140337.6A CN118051366A (en) 2024-01-31 2024-01-31 Fault processing method and computing device

Publications (1)

Publication Number Publication Date
CN118051366A true CN118051366A (en) 2024-05-17

Family

ID=91053139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410140337.6A Pending CN118051366A (en) 2024-01-31 2024-01-31 Fault processing method and computing device

Country Status (1)

Country Link
CN (1) CN118051366A (en)

Similar Documents

Publication Publication Date Title
JP4641546B2 (en) Method and system for handling input / output (I / O) errors
US8875154B2 (en) Interface specific and parallel IPMI message handling at baseboard management controller
US20160283305A1 (en) Input/output control device, information processing apparatus, and control method of the input/output control device
US11068337B2 (en) Data processing apparatus that disconnects control circuit from error detection circuit and diagnosis method
CN105849702A (en) Cluster system, server device, cluster system management method, and computer-readable recording medium
CN110896362B (en) Fault detection method and device
US8819218B2 (en) Apparatus, system, and method for device level enablement of a communications protocol
CN114826962A (en) Link fault detection method, device, equipment and machine readable storage medium
US20160197994A1 (en) Storage array confirmation of use of a path
CN114880266B (en) Fault processing method and device, computer equipment and storage medium
CN117271234A (en) Fault diagnosis method and device, storage medium and electronic device
CN115454896A (en) SMBUS-based SSD MCTP control message verification method and device, computer equipment and storage medium
CN112148537A (en) Bus monitoring device and method, storage medium, and electronic device
CN115037653B (en) Service flow monitoring method, device, electronic equipment and storage medium
CN118051366A (en) Fault processing method and computing device
CN107818061B (en) Data bus and management bus for associated peripheral devices
CN116126613A (en) Position detection method and device of PCIe (peripheral component interconnect express) equipment, electronic equipment and storage medium
CN112214437B (en) Storage device, communication method and device and computer readable storage medium
CN112804115B (en) Method, device and equipment for detecting abnormity of virtual network function
CN113434324A (en) Abnormal information acquisition method, system, device and storage medium
CN109086179B (en) Processing method and device under program exception condition
CN115712493A (en) Request processing method, device and system
US20140059389A1 (en) Computer and memory inspection method
CN107451035B (en) Error state data providing method for computer device
CN116185678A (en) Fault log recording method, system method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination