CN117931536A

CN117931536A - Fault processing method, device, electronic equipment and medium

Info

Publication number: CN117931536A
Application number: CN202211267853.2A
Authority: CN
Inventors: 笪禹; 张海强; 刘立超; 张永肃; 佘开锐; 张宇; 王剑
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2024-04-26

Abstract

Embodiments of the present disclosure relate to fault handling methods, apparatus, electronic devices, and media. The method includes detecting, at a hardware acceleration device, a first failure of the hardware acceleration device. The method also includes detecting, at a host device communicatively coupled to the hardware acceleration device, a second failure of the hardware acceleration device. The method also includes detecting, at a baseboard management controller communicatively coupled to the hardware acceleration device and the host device, a third failure of the host device and the hardware acceleration device. The method further includes determining one of the hardware acceleration device, the host device, and the baseboard management controller to perform a repair operation based on the failure type of the first failure, the second failure, or the third failure. Based on the mode, through a hierarchical fault discovery and processing mechanism, the coverage of fault processing of the hardware acceleration device is improved, so that the fault of the hardware acceleration device can be discovered and effectively processed in time.

Description

Fault processing method, device, electronic equipment and medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technology and, more particularly, relate to fault handling methods, apparatus, electronic devices, computer readable storage media, and computer program products.

Background

In recent years, hardware acceleration devices have been widely used. The hardware acceleration device is used for carrying out hardware acceleration on a specific processing flow, so as to improve the processing capacity of the system, such as training or reasoning of an Artificial Intelligence (AI) model, processing of network data packets and the like.

With the increasing complexity of the software and hardware of various hardware acceleration devices, various system failures may occur. The faults are various and the processing modes in different application scenes are different, so that the use of the hardware acceleration device is challenged.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a fault handling scheme for a hardware acceleration device.

According to a first aspect of the present disclosure, a fault handling method is provided. The method includes detecting, at a hardware acceleration device, a first failure of the hardware acceleration device. The method also includes detecting, at a host device communicatively coupled to the hardware acceleration device, a second failure of the hardware acceleration device. The method also includes detecting, at a baseboard management controller communicatively coupled to the hardware acceleration device and the host device, a third failure of the host device and the hardware acceleration device. The method further includes determining one of the hardware acceleration device, the host device, and the baseboard management controller to perform a repair operation based on the failure type of the first failure, the second failure, or the third failure.

According to a second aspect of the present disclosure, there is also provided a fault handling apparatus. The apparatus includes a first failure detection unit configured to detect a first failure of a hardware acceleration device at the hardware acceleration device. The apparatus further includes a second failure detection unit configured to detect a second failure of the hardware acceleration device at a host device communicatively connected to the hardware acceleration device. The apparatus further includes a third failure detection unit configured to detect a third failure of the host device and the hardware acceleration device at a baseboard management controller communicatively connected to the hardware acceleration device and the host device. The apparatus further includes a fault handling unit configured to determine one of the hardware acceleration device, the host device, and the baseboard management controller to perform a repair operation based on a fault type of the first fault, the second fault, or the third fault.

According to a third aspect of the present disclosure there is provided an electronic device comprising at least one processing unit and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit causing the electronic device to perform a method according to the first aspect of the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium comprising machine executable instructions which, when executed by a device, cause the device to perform a method according to the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising machine executable instructions which, when executed by a device, cause the device to perform a method according to the first aspect of the present disclosure.

This content section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 shows a schematic flow chart diagram of a fault handling method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of a fault handling system according to an embodiment of the present disclosure;

FIG. 4 shows a schematic flow chart of a process of a hardware acceleration device handling a failure according to an embodiment of the disclosure;

FIG. 5 shows a schematic block diagram of a process for fault analysis according to an embodiment of the present disclosure;

FIG. 6 shows a schematic block diagram of a fault handling apparatus according to an embodiment of the present disclosure; and

FIG. 7 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.

Detailed Description

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

It is noted that the numbers or values used herein are for ease of understanding the technology of the present disclosure, and are not limiting the scope of the present disclosure.

The hardware acceleration device is used for carrying out hardware acceleration on some processing tasks and improving the processing capacity of the system. The hardware acceleration device may be coupled to the host device through an interface such as peripheral component interconnect express (PCIe), receiving tasks from the host device and processing. However, as hardware acceleration devices become increasingly complex in hardware and software, the variety of failures is wide, including failures of business software, failures of operating systems, hardware failures, transmission links, and the like. Conventional fault detection schemes have difficulty in finding such a large number of faults in time. On the other hand, the fault processing modes under different application scenes are also different, and the faults are difficult to be effectively solved. An intelligent detection and processing method for the faults of the hardware acceleration device is needed, so that the faults of the hardware acceleration device can be timely found and effectively processed.

Embodiments of the present disclosure provide a fault handling method for a hardware acceleration device, which is capable of monitoring various software and hardware faults, and performing hierarchical fault detection and handling through cooperation of the hardware acceleration device, a host device, and a Baseboard Management Controller (BMC) which are communicatively connected to each other. The method includes detecting a first fault at the hardware acceleration device, which may be a fault generated internally to the hardware acceleration device. The method further includes detecting, at the host device, a second failure of the hardware acceleration device that cannot be detected by the hardware acceleration device itself, but that can be detected by the host device. The method further includes detecting, at the baseboard management controller, a third failure of the host device and the hardware acceleration device. The third failure cannot be detected by the host device and the hardware acceleration device, but can be detected by the baseboard management controller. The method further includes determining one of the hardware acceleration device, the host device, and the baseboard management controller to perform a repair operation based on the failure type of the first failure, the second failure, or the third failure. In some embodiments, the hardware acceleration device may repair the first failure and request the host device to handle if it cannot. The host device may handle the second failure and the first failure that the hardware processing device cannot repair and request the baseboard management controller to handle if it cannot repair. The baseboard management controller can handle the third failure and can handle the first failure and the second failure that the hardware acceleration device and the host device cannot repair. Based on the mode, the embodiment of the disclosure realizes layered fault detection and processing aiming at the hardware acceleration equipment, and improves the coverage of fault processing.

Implementation details of embodiments of the present disclosure are described in detail below with reference to fig. 1 through 7.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. It should be understood that the example environment 100 illustrated in fig. 1 is only exemplary and should not be construed as limiting the functionality and scope of the implementations described in this disclosure. As shown in fig. 1, environment 100 includes an electronic device 101 and a failure analysis platform 102. The electronic device 101 may be any device having computing capabilities, such as a desktop computer, a server, a mainframe, a distributed computing system, and so forth. The electronic device 101 includes a hardware acceleration device 110, a host device 120, and a baseboard management controller 130. The hardware acceleration device 110 and the host device 120 may be installed in slots of a motherboard of the electronic device 101. The baseboard management controller 130 may be a processing unit or a microcontroller implemented on a motherboard of the electronic device 101.

Hardware acceleration device 110 may include, for example, an acceleration card based on peripheral component interconnect express (PCIe) or other communication protocols. PCIe is a high-speed serial computer expansion bus standard, belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission, and connected devices allocate exclusive channel bandwidths, do not share bus bandwidths, and support functions such as hot plug and error reporting. Hardware acceleration device 110 may include, but is not limited to, a PCIe network card, a Graphics Processor (GPU) card, non-volatile memory host controller interface specification (NVME) memory, a smart network card, and the like. The hardware acceleration device 110 assumes computing tasks, such as network packet processing, storage, and parallel computing, offloaded from the host device 120 in the electronic device 101 to provide hardware acceleration functionality. For possible faults, the hardware acceleration device 110 may have fault detection and processing capabilities for fault monitoring, information extraction, and fault handling of the various software and hardware modules in the device. For failure to handle, the hardware acceleration device 110 may notify the host device 120 to handle.

Host device 120 may include one or more processing units (e.g., a Central Processing Unit (CPU) or processing core), random Access Memory (RAM), persistent storage (e.g., a disk or fixed storage (SSD)). The host device 120 may perform any type of computing task and in some cases issue specific computing tasks to the hardware acceleration device 110 for processing, implementing hardware acceleration functions. The host device 120 has fault detection and processing capabilities for performing device-level fault detection and processing for the hardware acceleration device, for which a failure that cannot be repaired may be requested to be processed by the baseboard management controller 130.

The baseboard management controller 130 communicates and controls other software and hardware components inside the electronic device 101 through various interfaces such as an Intelligent Platform Management Bus (IPMB), an SMBus, and the like. The baseboard management controller 130 has fault detection and processing capabilities for performing overall fault detection and processing for the host device 120 and the hardware acceleration device 110.

The failure analysis platform 102 may be a server or cloud service running on a server communicatively coupled with the electronic device 101 (e.g., via the internet or a local area network). The failure analysis platform 102 may be configured to obtain and analyze the failure information and the environment information (e.g., software and hardware information of the hardware acceleration device 110 and the host device 120, computing task information) when the failure occurs from the electronic device 101, generate an analysis result, and issue to the electronic device 101. The fault analysis platform 102 may discover correlations between faults and environmental information using techniques such as machine learning in order to adjust the strategy and repair method of fault handling. The fault analysis platform 102 may also provide a recommended or circumvention operating environment for the hardware acceleration device to reduce the probability of occurrence of a fault thereafter.

It should be understood that the environment 100 shown in fig. 1 is only one example in which embodiments of the present disclosure may be implemented and is not intended to limit the scope of the present disclosure. Embodiments of the present disclosure are equally applicable to other systems or architectures.

Fig. 2 shows a schematic flow diagram of a fault handling method 200 according to an embodiment of the present disclosure. The method 200 may be implemented by the electronic device 101 in fig. 1. It should be appreciated that method 200 may also include additional acts not shown and/or may omit acts shown, and that the acts in method 200 may be implemented in a different order, the scope of the disclosure being not limited in this respect.

At block 210, the electronic device detects a first failure of the hardware acceleration device at the hardware acceleration device. The first fault includes a fault generated and detected inside the hardware acceleration device 110. Fig. 3 shows a schematic block diagram of a fault handling system 300 implemented within an electronic device 101, in which an exemplary architecture of a hardware acceleration device is shown, according to an embodiment of the disclosure. For ease of illustration, the method 200 is described in connection with FIG. 3.

As shown in fig. 3, the hardware acceleration device 110 includes a hardware engine 111, a processing unit 113, a flash memory 115, and a memory 117. The hardware engine 111 may be, for example, a parallel computing unit having multiple processing cores for processing offload computing tasks from the host device 120, such as model training and reasoning tasks or network packet processing tasks. The processing unit 113 runs the business software 114 and the operating system 116, wherein the business software 114 interfaces with the host device 120 for assigning computing tasks to the hardware engine 111 for execution and returning execution results to the host device 120. Flash memory 115 is a persistent storage device that stores program code and configuration parameters for hardware acceleration device 110. Memory 117 is Random Access Memory (RAM) for storing code and data for the rows and columns. As shown, the processor unit 113 of the hardware acceleration device 110 further runs a fault detection processing module 112 for detecting a first fault generated by the above-mentioned components of the hardware acceleration device 110.

The first failure may include a failure of the hardware engine 111. Upon failure of the hardware engine 111, the failure detection processing module 112 may be reported by interrupt. On the other hand, the fault detection processing module 112 may also periodically check the engine processing task condition reported by the software, and if it is found that the hardware engine 111 cannot complete the processing task for a long time, determine that the hardware engine 111 is in a fault state. The first failure may also include a failure of a storage device in the hardware acceleration device 110. Such faults may include, for example, bad blocks in flash memory 115, caches (caches) of memory 117 and processing unit 113, and other uncorrectable Error Correction Code (ECC) errors that may occur with RAM. The first failure may also include a failure of the operating system 116. When the operating system 116 fails, a trigger failure notification is sent to the failure detection processing module 112. The first failure may also include a failure of the business software 114. The failure of the business software 114 may be captured by the operating system 116 and then notified by the operating system to the failure detection processing module 112. The first fault may also include a watchdog fault. In particular, the fault detection processing module 112 may have a watchdog function that triggers a fault when the watchdog times out. It should be appreciated that the first fault may also include other types of faults generated within the hardware acceleration device 110.

At block 220, the electronic device detects a second failure of the hardware acceleration device at a host device communicatively coupled to the hardware acceleration device. Unlike the first failure, the second failure is generally not detected by the failure detection processing module 112 inside the hardware acceleration device 110, but at this time the hardware acceleration device 110 may be degraded or even shut down due to such a failure. In some embodiments, host device 120 may perform a survival detection for hardware acceleration device 110. In particular, host device 120 may run failure detection processing module 122, and failure detection processing module 122 may periodically send a survival monitoring message to failure detection processing module 112 of hardware acceleration device 110 via link 12 (e.g., a PCIe link). The fault detection processing module 112 should return a corresponding acknowledgement response when the hardware acceleration device 110 is operating normally. However, if the number of losses of the survival monitor (i.e., no response received) reaches the threshold, the host device 120 may determine that a second failure is detected.

At block 230, the electronic device detects a third failure of the host device and the hardware acceleration device at a baseboard management controller communicatively coupled to the hardware acceleration device and the host device. The third failure is generally undetectable by the hardware acceleration device 110 and the host device 120, but the hardware acceleration device 110 or the host device 120 may be degraded or even shut down due to such failures.

For example, the communication components (e.g., PCIe controller or link-related device or firmware) of the host device 120 and the hardware acceleration device 110 cause the communication link between the two to fail. In this case, the link 12 between the host device 120 and the hardware acceleration device 110 is not available, and if the host device 120 and the hardware acceleration device 110 initiate access to each other, this may cause the bus/processor to fail and not recover. The fault detection and handling may then be performed by the baseboard management controller 130. In some embodiments, baseboard management controller 130 may initiate survival monitoring for host device 120 and hardware acceleration device 110 via links 11 and 13, respectively. Specifically, the base management controller 130 may run the failure detection processing module 132, with the failure detection processing module 132 periodically sending the survival monitoring message to the hardware acceleration device 111 via the link 11 (e.g., SMBus) and periodically sending the survival monitoring message to the host device 120 via the link 13 (e.g., SMBus). If the number of losses for survival monitoring of either of the two reaches a threshold, the baseboard management controller 130 can determine that a third failure is detected. The fault detection processing module 132 may notify the hardware acceleration device 110 and the host device 120, via links 11 and 13, respectively, that a third fault was detected and collect the operating environment information of the hardware acceleration device 110 and the host device 120.

At block 240, the electronic device determines one of a hardware acceleration device, a host device, and a baseboard management controller to perform a repair operation based on the failure type of the first failure, the second failure, or the third failure. The electronic device 101 may handle the failure based on a failure handling policy. Specifically, policies may be utilized to determine faults that the hardware acceleration device 110, the host device 120, and the baseboard management controller 130 are capable of repairing. Policies also specify repair operations for specific fault types, including, but not limited to, resetting, reconfiguring parameters, adjusting operating environments, repairing, replacing components, and the like. The above-mentioned faults may be handled by the hardware acceleration device 110, the host device 120, or the baseboard management controller 130 in a hierarchical manner, wherein the hardware acceleration device 110 is responsible for a first level of fault handling, the host device 120 is loaded with a second level of fault handling, and the baseboard management controller 130 is responsible for a third level of fault handling.

At a first processing level, the hardware acceleration device 110 processes a first failure. Fig. 4 shows a schematic flow diagram of a process 400 for the hardware acceleration device 110 to handle a failure according to an embodiment of the disclosure. It should be appreciated that process 400 may also include additional acts not shown and/or may omit acts shown, and that the acts in process 400 may be implemented in a different order, the scope of the disclosure being not limited in this respect.

At block 410, the hardware acceleration device 110 determines whether a first failure is detected. For example, in response to the failure detection module 112 actively detecting or being notified by other components that a first failure has occurred, it is determined that the first failure has been detected.

At block 420, the hardware acceleration device 110 obtains fault information and environment information for the first fault. The fault information and environment information may be subsequently retrieved by the host device 120 and submitted to the fault analysis platform 102 for use in improving the operating environment or configuration parameters of the hardware acceleration device 110 to reduce the likelihood of a fault occurring.

For hardware engine faults, the fault detection processing module 112 may extract runtime information of the hardware engine 111 registers and software as fault information, and may also extract task models and software configuration contexts being processed when the fault occurs, which can help improve subsequent hardware design reliability. For failures of storage devices (e.g., memory 117, flash memory 115, etc.) in the hardware acceleration device 110, the failure detection processing module 112 may also present operational state information of the hardware device, such as voltage, power consumption, ambient temperature, system load, etc., in addition to information of the failure itself. For operating system failures and business software failures, the failure detection processing module 112 will obtain failure field information and obtain operating modes or parameters of the operating system and business software when the failure occurs. For a system watchdog fault, the fault detection processing module 112 extracts fault information according to the watchdog interrupt and software and hardware information of the hardware acceleration device 110 when the watchdog fault occurs.

The fault detection processing module 112 may store the fault information and the environmental information locally, such as in the memory 117 or the flash memory 115. The failure detection processing module 112 may save various types of failure information into the memory 117, and additionally or alternatively, may also save information into the flash memory 115, preventing failure of the host device 120 to extract information due to link reasons or other reasons.

At block 430, the hardware acceleration device 110 notifies the host device 120. The failure detection processing module 112 may send a message to the failure detection processing module 122 of the host device 120 informing the host device 120 to obtain the failure information and the environmental information about the first failure.

At block 440, the hardware acceleration device 110 determines whether itself is capable of repairing the first failure. If so, the process 400 proceeds to block 450 to process the first fault according to the fault handling policy. If not, the process 400 proceeds to block 460 to send a failure notification to the host device 120 requesting the host device 120 to handle the first failure. An exemplary fault handling strategy is described next.

In the case where the failure type of the first failure is a hardware engine failure, the failure detection processing module 112 may process the failure by resetting the hardware engine 111.

For storage device failures, the failure detection processing module 112 may perform failure isolation for uncorrectable ECC errors of the memory 117 (e.g., double Data Rate (DDR) memory). If the wrong memory location is used by the operating system 116, the operating system 116 needs to be notified for page fault handling, if the wrong memory location is in the device memory pool managed by the host device 120, the host device 120 needs to be notified at block 460, memory isolation by the failure detection processing module 122 of the host device 120, and the subsequent host device 120 avoids allocating the region while allocating device memory. Additionally, a failure of memory 117 may affect the processor cache, in which case erroneous information in the processor environment may be discarded. For a failure of flash memory 115, such as failure detection module 112 finding that its bad block exceeds a threshold, host device 120 may be notified and an administrator reported by host device 130 for replacement or repair.

For a failure of the operating system 116, the failure detection processing module 112 may save the failure site information and trigger a restart of the operating system. In some embodiments, the fault detection processing unit 112 may register a fault notification with the operating system 11, and the fault notification may be triggered in the flow of operating system fault generation, where the fault detection processing unit 112 collects and saves fault information according to a fault handling policy, and notifies the fault detection processing module 122 of the host device 120.

For faults of the service software 114, the fault detection processing module 112 may extract environmental information when the service software fails and notify the fault detection processing module 122 on the host computer to restart the service software.

For watchdog faults, the fault detection processing module 112 may report the fault type to the fault detection processing module 122 of the host device. Then, the hardware acceleration device 110 is restarted by the host device 120. At restart, the configuration of the hardware acceleration device 110 may be adjusted or the load reduced to avoid a re-failure.

Next, at the second processing level, the host device 120 processes the first failure notified from the hardware acceleration device 110 and the second failure detected by itself. As mentioned above, these faults are faults that the hardware processing device 110 cannot repair, and the host device 120 may process the faults according to the corresponding fault handling policy, e.g., determine whether it can handle the faults itself and the operation of repairing the faults. For example, the host device 120 may determine whether to reset the hardware acceleration device 110 and determine the parameter adjustment after reset. If the host device 120 is also unable to process, the baseboard management controller 130 may be further notified to process at the third processing level.

The fault handling policy may specify that the default repair operation for the second fault is to restart hardware acceleration device 110. In some embodiments, a hot restart of hardware acceleration device 110 is initiated at host device 120. Here, a hot restart refers to a process in which the link controller (e.g., PCIe controller) and the link-related devices are not reset and reconfigured during the restart of the hardware acceleration device 110, so that the hardware acceleration device 110 may communicate with the host device 120 immediately after the restart. The baseboard management controller 130 may also be notified if the host device 120 cannot repair the second failure, so that processing is performed at the third processing level.

At the third processing level, the baseboard management controller 130 processes the first and second faults notified from the host device 120 and the third fault detected by itself. As mentioned above, these faults are faults that the hardware acceleration device 110 and the host device 120 cannot repair, and the baseboard management controller 130 may process the reported faults according to a fault handling policy, for example, decide whether to reset the hardware acceleration device 110 and the host device 120, and adjust parameters after the reset.

For a third failure, such as a link controller (e.g., PCIe controller) and link related devices/firmware causing a failure of link 12 (e.g., PCIe link) between host device 120 and hardware acceleration device 110, baseboard management controller 130 may initiate a reset operation to hardware acceleration device 110 and host device 120 over separate links 11 and 13. Links 11 and 13 may be different from link 12, e.g., an SMBus-based link. The baseboard management controller 140 can also specify parameter adjustments after the hardware acceleration device 110 and the host device 120 are reset.

Fig. 5 shows a schematic block diagram of a process 500 for fault analysis according to an embodiment of the disclosure. Process 500 may be implemented by electronic device 101. More specifically, process 500 may be implemented by host device 120 in electronic device 101. It should be appreciated that process 500 may also include additional acts not shown and/or may omit acts shown, and that the acts in process 500 may be implemented in a different order, the scope of the disclosure being not limited in this respect.

At block 510, the host device obtains failure information for a first failure and first environmental information for the hardware acceleration device when the first failure occurred.

As mentioned above, in the case where the first failure is a hardware engine failure, the failure detection processing module 112 extracts failure information including some runtime information of the registers and software of the hardware engine 111. The fault detection processing module 112 may also obtain environmental information including the task model being processed at the time of the fault and the software configuration context. In the case where the first failure is a failure of the storage device, the failure detection processing module 112 acquires information of the failure itself and operation state information of the current hardware device, such as voltage, power consumption, ambient temperature, system load, and the like, as the first environmental information. For operating system faults and business software faults, the fault detection processing module 112 obtains fault field information of the operating system, fault field information of the business software, software and hardware configuration parameters of the hardware acceleration device 110 when the faults occur, and the like as first environment information. For a watchdog fault, the fault detection processing module 112 extracts fault information from the watchdog interrupt and extracts system state information of a fault scenario as first environment information. The above information is stored in the flash memory 115 or the memory 117 of the hardware acceleration device 110.

In response to receiving the notification from the hardware acceleration device 110 (e.g., block 430 of fig. 4), the failure detection processing module 122 of the host device 120 may obtain the failure information and the first context information via the link 12.

At block 520, the host device obtains second environmental information for the host device and the hardware acceleration device when the second failure occurs. To obtain the second environmental information, the failure detection processing module 122 of the host device 120 initiates a software and hardware health check for the host device 120 and the hardware acceleration device 110 and records these check information in response to detecting the second failure. The host device 120 may also obtain the task model and the environmental information of the software configuration context that the host device 120 assigned to the hardware acceleration device 110 when the second failure occurred.

At block 530, the host device obtains third environmental information for the host device and the hardware acceleration device when a third failure occurs. In response to being notified by the failure detection processing module 132 of the baseboard management controller 130 that the third failure is detected, the failure detection processing module 122 of the host device 120 initiates health checks for its own hardware and software and records these check information. In some cases, the third failure may be a link failure between the hardware acceleration device 110 and the host device 120. At this time, the failure detection processing module 122 may also acquire inspection information of the hardware acceleration device 110 via the baseboard management controller 140. In other words, the hardware acceleration device 110 acquires its own software and hardware health check information and running environment information in response to being notified of the third failure, and transmits these information to the baseboard management controller 130 via the link 11. The host device 120 acquires the above-described environmental information of the hardware acceleration device from the baseboard management controller 130 via the link 13.

At block 540, the host device transmits the fault information, the first environmental information, the second environmental information, and the third environmental information to the fault analysis platform. The fault detection processing module 122 of the host device 120 may report the obtained fault-related information to the fault analysis platform 102 via any type of communication link. In general, the fault analysis platform 102 may be configured to intelligently analyze fault-related information and provide updateable fault handling policies, recommended operating environments, and evasive operating environments.

In some embodiments, the fault analysis platform 102 may perform data statistics and analysis calculations based on the received fault-related information, update the fault handling policies of the electronic device 101 using techniques such as machine learning. The updated fault handling policies may be sent to the electronic device 101 and deployed to the fault detection handling module 112 of the hardware acceleration device 110, the fault detection handling module 122 of the host device 120, and the fault detection handling module 132 of the baseboard management controller 130, respectively.

The fault analysis platform 102 may also analyze the distribution of software and hardware faults based on the received fault-related information in order to improve subsequent software design and hardware selection. The fault analysis platform 102 may also analyze software and hardware states and system configuration parameters when a fault occurs, and subsequently avoid similar fault scenarios. The fault analysis platform 120 may also analyze the software fault location for configurations or patches that may be repaired or circumvented. That is, the fault analysis platform 102 may generate and transmit recommended and/or evasive operating environments for the host device 120 and the hardware acceleration device 110 to the electronic device 101 by analyzing the fault-related information, thereby reducing the probability of occurrence of subsequent faults.

For example, for hardware engine failures, the failure analysis platform 102 may analyze the type and frequency of failure occurrences, as well as the task model and software configuration context being processed at the time of the failure occurrence, thereby helping to improve subsequent hardware design reliability. The fault analysis platform 102 may also analyze whether the fault can be circumvented and which software configuration mode can reduce the probability of the fault, thereby informing the host device 120 to select a lower fault rate mode of operation during subsequent operations.

As another example, for storage device failures, failure analysis platform 102 may perform intelligent analysis based on operational state information of hardware devices, such as voltage, power consumption, ambient temperature, system load, and the like. Thus, the electronic device 101 may find an optimal hardware state by adjusting parameters of device operating frequency, ambient temperature, workload, etc.

For operating system failures, the failure analysis platform 102 may look for potential system problems and configuration problems and give suggested configuration patterns. If it is a system error, host device 102 may be recommended to issue a patch file to hardware acceleration device 110 for repair. The host device 102 may perform a reset of the hardware acceleration device 110 and perform system configuration and install patches as suggested.

For business software faults, the fault analysis platform 102 can find the software fault location, analyze the running mode of the software when the software is faulty, and try to avoid re-faults as much as possible by adjusting the working mode and parameters of the software.

For the watchdog fault, the fault analysis platform 102 may analyze the system state of the watchdog fault scenario, analyze the software working mode and the system load with problems, and avoid triggering subsequent similar faults by adjusting the configuration and reducing the load.

The fault handling method of the embodiment of the present disclosure is described above with reference to fig. 1 to 5. Compared with the traditional method, the embodiment of the disclosure improves the coverage of fault processing through a hierarchical fault discovery and processing mechanism, so that faults of hardware acceleration equipment can be discovered and effectively processed in time. In some embodiments, further improvements in subsequent hardware and software design are provided through intelligent analysis to reduce the probability of occurrence of subsequent failures.

Fig. 6 shows a schematic block diagram of a fault handling apparatus 600 according to an embodiment of the disclosure. The apparatus 600 may be arranged at the electronic device 101. As shown, the apparatus 600 includes a first fault detection unit 610, a second fault detection unit 620, a third fault detection unit 630, and a fault handling unit 640. The first failure detection unit 610 is configured to detect a first failure of the hardware acceleration device at the hardware acceleration device. The second failure detection unit 620 is configured to detect a second failure of the hardware acceleration device at a host device communicatively connected to the hardware acceleration device. The third failure detection unit 630 is configured to detect a third failure of the host device and the hardware acceleration device at a baseboard management controller communicatively connected to the hardware acceleration device and the host device. The fault handling unit 640 is configured to determine one of the hardware acceleration device, the host device, and the baseboard control manager to perform a repair operation based on the fault type of the first fault, the second fault, or the third fault.

In some embodiments, the first fault detection unit 610 may be configured to detect at least one of: failure of a hardware engine in a hardware acceleration device, failure of a storage device in the hardware acceleration device, failure of an operating system running on the hardware acceleration device, failure of business software running on the hardware acceleration device, watchdog failure of the hardware acceleration device.

In some embodiments, the fault handling unit 640 may be further configured to repair the first fault using the hardware acceleration device in response to determining that the first fault is capable of being repaired at the hardware acceleration device based on the fault type of the first fault; and in response to determining that the first failure cannot be repaired at the hardware acceleration device, notifying the host device to cause the host device to process the first failure.

In some embodiments, the second fault detection unit 620 may be configured to: performing, at the host device, survival monitoring for the hardware acceleration device; and determining that a second fault is detected if the number of loss of alive monitoring reaches a threshold.

In some embodiments, the fault handling unit 640 may be configured to: in response to detecting that the second failure is a failure from the surviving detection, a hot restart of the hardware acceleration device is initiated at the host device.

In some embodiments, the third fault detection unit 630 may be configured to: a failure of a first communication link between a host device and a hardware acceleration device is detected.

In some embodiments, the fault handling unit 640 may be configured to: responsive to detecting the failure of the first communication link, a reset operation is initiated at the baseboard management controller to the host device and the hardware acceleration device via a second communication link, the second communication link being different from the first communication link.

In some embodiments, the apparatus 600 may further include an information collecting unit. The information collection unit is configured to: acquiring fault information of a first fault and first environment information of hardware acceleration equipment when the first fault occurs; acquiring second environment information of the host device and the hardware acceleration device when a second fault occurs; acquiring third environment information of the host device and the hardware acceleration device when a third fault occurs; and sending the fault information, the first environment information, the second environment information and the third environment information to a fault analysis platform.

In some embodiments, the apparatus 600 may further comprise a receiving unit. The receiving unit is configured to: receiving a recommended operating environment and/or an evading operating environment for the host device and the hardware acceleration device from the fault analysis platform; and receiving an updated fault handling policy from the fault analysis platform.

In some embodiments, the hardware acceleration device may include a peripheral component interconnect express (PCIe) acceleration device including a hardware engine and configured to perform computing tasks from the host device using the hardware engine.

Fig. 7 shows a schematic block diagram of an example device 700 that may be used to implement embodiments of the present disclosure. For example, a fault handling method, a fault handling apparatus, an electronic device according to embodiments of the present disclosure may be implemented by the device 700. As shown, the device 700 includes a Central Processing Unit (CPU) 701 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 702 or loaded from 700 storage unit 708 into Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The CPU701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705. Types of I/O interfaces include, but are not limited to, peripheral component interconnect express (PCIe), universal Serial Bus (USB), high Definition Multimedia Interface (HDMI), serial connection (SAS), and the like. The components based on the I/O interface 705 may include, but are not limited to: an input-output unit 706 such as a keyboard, a mouse, a display, a speaker, and the like; hardware acceleration device 707, such as a PCIe network card, graphics Processor (GPU) card, non-volatile memory host controller interface specification (NVME) storage, smart network card, etc.; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network adapter, modem, wireless communication transceiver, or the like. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The various processes and processes described above, such as processes 200, 400, and/or 500, may be performed by a processing unit in device 700, such as processing unit 701, and/or other processing units (e.g., a microprocessor on a motherboard of device 700, a processing unit in hardware acceleration device 707, etc.). For example, in some embodiments, the method processes 200, 400, and/or 500 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by, one or more actions of processes 200, 400, and/or 500 described above may be performed.

The present disclosure may be methods, apparatus, systems, and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure have been described above, the foregoing description is illustrative, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A fault handling method, comprising:

detecting, at a hardware acceleration device, a first failure of the hardware acceleration device;

detecting a second failure of the hardware acceleration device at a host device communicatively coupled to the hardware acceleration device;

Detecting a third failure of the host device and the hardware acceleration device at a baseboard management controller communicatively coupled to the hardware acceleration device and the host device;

One of the hardware acceleration device, the host device, and the baseboard management controller is determined to perform a repair operation based on a failure type of the first failure, the second failure, or the third failure.

2. The method of claim 1, wherein detecting the first failure of the hardware acceleration device comprises detecting at least one of:

Failure of a hardware engine in the hardware acceleration device, failure of a storage device in the hardware acceleration device, failure of an operating system running on the hardware acceleration device, failure of business software running on the hardware acceleration device, or watchdog failure of the hardware acceleration device.

3. The method of claim 1 or 2, wherein one of a hardware acceleration device, a host device, and a baseboard management controller is determined to perform a repair operation:

repairing the first fault using the hardware acceleration device in response to determining that the first fault can be repaired at the hardware acceleration device based on a fault type of the first fault; and

In response to determining that the first failure cannot be repaired at the hardware acceleration device, the host device is notified to cause the host device to process the first failure.

4. The method of claim 1, wherein detecting the second failure of the hardware acceleration device comprises:

Performing, at the host device, survival monitoring for the hardware acceleration device; and

In the event that the number of losses of the survival monitor reaches a threshold, it is determined that the second fault is detected.

5. The method of claim 4, wherein one of the hardware acceleration device, the host device, and the baseboard management controller is determined to perform a repair operation:

in response to detecting that the second failure is a failure from the surviving detection, a warm restart of the hardware acceleration device is initiated at the host device.

6. The method of claim 1, wherein detecting a third failure of the host device and the hardware acceleration device comprises:

a failure of a first communication link between the host device and the hardware acceleration device is detected.

7. The method of claim 6, wherein one of the hardware acceleration device, the host device, and the baseboard management controller is determined to perform a repair operation:

Responsive to detecting the failure of the first communication link, a reset operation is initiated at the baseboard management controller to the host device and the hardware acceleration device via a second communication link, the second communication link being different from the first communication link.

8. The method of claim 1, the method further comprising:

acquiring fault information of the first fault and first environment information of the hardware acceleration equipment when the first fault occurs;

Acquiring second environment information of the host device and the hardware acceleration device when the second fault occurs;

acquiring third environment information of the host device and the hardware acceleration device when the third fault occurs; and

And sending the fault information, the first environment information, the second environment information and the third environment information to a fault analysis platform.

9. The method of claim 8, further comprising:

receiving a recommended operating environment and/or an evasive operating environment for the host device and the hardware acceleration device from the failure analysis platform; and

A fault handling policy is received from the fault analysis platform.

10. The method of claim 1 or 2, wherein the hardware acceleration device comprises a peripheral component interconnect express (PCIe) acceleration device comprising a hardware engine and configured to execute computing tasks from the host device using the hardware engine.

11. A fault handling apparatus comprising:

a first failure detection unit configured to detect, at a hardware acceleration device, a first failure of the hardware acceleration device;

a second failure detection unit configured to detect a second failure of the hardware acceleration device at a host device communicatively connected to the hardware acceleration device;

a third failure detection unit configured to detect a third failure of the host device and the hardware acceleration device at a baseboard management controller communicatively connected to the hardware acceleration device and the host device; and

And a fault handling unit configured to determine one of the hardware acceleration device, the host device, and the baseboard management controller to perform a repair operation based on a fault type of the first fault, the second fault, or the third fault.

12. An electronic device, comprising:

at least one processing unit; and

At least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the electronic device to perform the method of any one of claims 1-10.

13. A computer readable storage medium comprising machine executable instructions which, when executed by a device, cause the device to perform the method of any one of claims 1 to 10.

14. A computer program product comprising machine executable instructions which, when executed by a device, cause the device to perform the method of any one of claims 1 to 10.