CN115454705A

CN115454705A - Fault processing method, related device, computer device, medium, and program

Info

Publication number: CN115454705A
Application number: CN202211212545.XA
Authority: CN
Inventors: 赵建平; 孙路遥
Original assignee: Shenzhen Xingyun Zhilian Technology Co ltd
Current assignee: Shenzhen Xingyun Zhilian Technology Co ltd
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-12-09
Also published as: CN114880266B; CN114880266A

Abstract

The application relates to the field of electric digital data processing of the Internet industry, and discloses a fault processing method, a fault processing device, computer equipment and a storage medium. The method comprises the following steps: responding to the detection of the fault of an embedded processor in the data processor, entering a response mode, wherein the response mode comprises the step of sending a hot plug interrupt signal to the host computer to isolate the host computer from the fault of the embedded processor, and the hot plug interrupt signal is used for indicating the embedded processor to execute hot plug operation; and in response to detecting that the embedded processor is repaired, sending a hot plug signal to the host computer, and exiting the answering mode to complete fault recovery, wherein the hot plug signal is used for indicating the embedded processor to execute hot plug operation. By implementing the embodiment of the application, fault isolation can be effectively realized, the host computer does not need to be restarted, and the influence on the host computer can be reduced to the greatest extent, so that the normal operation of the host computer can be ensured.

Description

Fault processing method, related apparatus, computer device, medium, and program

Technical Field

The present application relates to the field of electrical digital data processing in the internet industry, and in particular, to a method, related apparatus, computer device, storage medium, and program for fault handling.

Background

With the rapid development of data centers, communication capacity and computing capacity become two important development directions that supplement each other in data center infrastructure. If the data center only concerns the improvement of the computing power and the improvement of the communication infrastructure cannot keep up, the overall system performance of the data center is still limited, and the real potential cannot be exerted. In order to cope with increasingly large and complex data volumes, data Processing Units (DPUs) have come to work.

The data processor is located in the cooperative processing unit, and is an implementation of the idea of separating the data plane from the control plane, and is cooperatively matched with a Central Processing Unit (CPU), and the CPU is responsible for general control, and the former is dedicated to data processing. That is, the data processor may offload data processing/pre-processing from the central processor while distributing computational power closer to where the data occurs, thereby reducing traffic. This also means that higher demands are placed on the reliability and availability of the data processor, since the data processor needs to move the calculations close to the data. In order to meet the requirement of the network for large data transmission, the data processor needs to use a peripheral component interconnect express (PCIe) interface for data transmission. Thus, the data processor is typically involved in the failure of the PCIe device, requiring the data processor system to assist the PCIe device in failover.

Currently, a data processor simulates a PCIe device at an Embedded Central Processing Unit (ECPU) end, and PCIe-related processing layer packets (TLPs) of a host are forwarded to the embedded processor for processing. This means that when a PCIe emulator of the embedded processor or the system itself fails, two situations may occur: firstly, the host user service is influenced, and even the host hangs up; the second is that restoring the data processor requires restarting the host, resulting in interruption of all programs being run. Therefore, how to isolate faults so as not to affect the service processing of users is a problem to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the application provides a fault processing method, a related device, computer equipment, a storage medium and a program, which can effectively realize fault isolation, do not need to restart a computer host, and can reduce the influence on the computer host to the maximum extent, thereby ensuring the normal operation of the computer host.

In a first aspect, an embodiment of the present application provides a method for fault handling, which is applied to a programmable logic device, where:

in response to detecting that an embedded processor in a data processor fails, the programmable logic device enters a solution mode, wherein the solution mode comprises sending a hot plug interrupt signal to a computer host to isolate the computer host from the embedded processor, and the hot plug interrupt signal is used for indicating that the embedded processor executes a hot plug operation;

and in response to detecting that the embedded processor is repaired, the programmable logic device sends a hot plug signal to the computer host, exits the answer mode and completes fault recovery, wherein the hot plug signal is used for indicating the embedded processor to execute hot plug operation.

In a second aspect, an embodiment of the present application provides an apparatus for fault handling, which is applied to a programmable logic device, where:

the fault isolation unit is used for responding to the detection that the embedded processor in the data processor has a fault, and the programmable logic device enters a solution mode, wherein the solution mode comprises the step of sending a hot plug interrupt signal to the computer host so as to isolate the computer host from the embedded processor in a fault mode, and the hot plug interrupt signal is used for indicating the embedded processor to execute hot plug operation;

and the fault recovery unit is used for responding to the detection that the embedded processor is repaired, the programmable logic device sends a hot plug signal to the computer host and exits from the answering mode to complete fault recovery, and the hot plug signal is used for indicating the embedded processor to execute hot plug operation.

In a third aspect, this application provides a computer device comprising a processor, a memory and a communication interface, wherein the memory stores a computer program configured to be executed by the processor, and the computer program comprises instructions for some or all of the steps as described in the first aspect of this application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, where the computer program causes a computer to perform some or all of the steps described in the first aspect of the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

by adopting the fault processing method, the fault processing device, the computer equipment and the storage medium, after the embedded processor in the data processor is detected to have a fault, the programmable logic device can enter the answering agency mode, and a hot plug interrupt signal is sent to the computer host and used for indicating the embedded processor to execute hot plug operation so as to disconnect the communication between the computer host and the embedded processor, so that the fault isolation is simply and efficiently completed. Therefore, the problem that in the traditional technical scheme, once the embedded processor fails, the host computer of the computer needs to be restarted to cause interruption of all programs in operation can be avoided, so that the influence on the host computer is reduced to the maximum extent, and the normal operation of the host computer is ensured. After the completion of the repair of the embedded processor is detected, the programmable logic device sends a hot plug signal to the computer host, and the hot plug signal is used for indicating the embedded processor to execute hot plug operation and quits the answer-substituting mode, so that the computer host and the embedded processor are communicated again, the fault recovery is completed quickly, and the fault processing efficiency is improved. In addition, according to the embodiment of the application, external tools such as a BMC system and a management and control platform are not needed, the dependence degree can be reduced, and the reliability is higher.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts. Wherein:

fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for fault handling according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "preset," and "fourth," etc. in the description and claims of this application and the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

It should also be understood that the term "and/or" herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

For ease of understanding, the following first presents several basic concepts related to the embodiments of the present application.

A Data Processing Unit (DPU), which is a large class of newly developed dedicated processors, is a third important computational chip in a data center scene after a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), and provides a computational engine for a high-bandwidth, low-latency, and data-intensive computational scene. The data processor mainly has the following three characteristics, namely unloading, accelerating and isolating. Accordingly, the three main application scenarios of the data processor are network, storage, security, respectively. In terms of offloading, the data processor may act as an offload engine for the central processor, releasing the power of the central processor to upper layer applications, e.g., the data processor may offload data center network services (virtual switching, virtual routing, etc.), data center storage services, data center security services (firewalls, encryption/decryption, etc.), etc. In terms of acceleration, the data processor will become a sandbox for algorithmic acceleration, the most flexible accelerator carrier. The data processor is not a solid Application Specific Integrated Circuit (ASIC), and under the laying of a data consistency access protocol such as a central processing unit, an image processor and a data processor, etc., advocated by a standard organization such as CXL (computer express link), etc., the data processor programming barrier will be further cleared, and in combination with a Field Programmable Gate Array (FPGA), etc., the customizable hardware will have a larger play space, the "software hardware" will become a normal state, and the potential of heterogeneous computing will be brought into full play due to the popularization of various data processors. In isolation, the data processor will become a new data gateway, raising security privacy to a new height. The asymmetric encryption algorithm SM2, the hash algorithm SM3, the symmetric block cipher algorithm SM4, and the like can be implemented by solidifying them in the data processor.

The high-speed serial component interconnect express (PCIe) belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission, and connected devices allocate an independent channel bandwidth and do not share a bus bandwidth. It defines slots and connectors of multiple widths: x1, x4, x8, x12, x16, and x32, typically, low speed peripherals (e.g., wiFi cards) use single channel (x 1) links, while graphics adapters use much faster, wider x16 channel links.

The Field Programmable Gate Array (FPGA) is a product developed further on the basis of programmable devices such as Programmable Array Logic (PAL) and the like, and can effectively solve the problem of a small number of gate circuits of the original devices. The basic structure of the FPGA comprises a programmable input/output unit, a configurable logic block, a digital clock management module, a wiring resource, an embedded special hard core, a bottom layer embedded functional unit and the like. The FPGA has the characteristics of abundant wiring resources, high repeatable programming and integration level and low investment, and is widely applied to the field of digital circuit design.

A Complex Programmable Logic Device (CPLD) adopts programming technologies such as an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, and a static random access memory (static RAM, SRAM), thereby forming a programmable logic device with high density, high speed, and low power consumption. The basic design method is to generate corresponding target files by means of integrated development software platform and by means of schematic diagram, hardware description language and other methods, and to transmit the codes to target chip via download cable to realize the designed digital system.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present disclosure. As shown in fig. 1, the system architecture may include a computer host 100 and a data processor 200. The computer host 100 may include a computer device having multiple architectures such as an Intel x86 architecture CPU, an ARM architecture CPU, and an MIPS architecture CPU as a computing core, including but not limited to application forms such as an industrial computer, a server, a vehicle-mounted computer, and a mobile workstation.

In an embodiment of the present application, the data processor 200 may include a programmable logic device 201 and an embedded processor 202. Data processor 200 may use a PCIe standard protocol for data transfer, and thus, data processor 200 may be a PCIe device. The programmable logic device 201 may be an FPGA, a System On Chip (SOC), an ASIC, or the like, may also be a multi-core processor, or may also be another programmable logic device, which is not limited in this embodiment of the present invention. Programmable logic device 201 can communicate with computer host 100 through a PCIe interface. The programmable logic device 201 may communicate with the embedded processor 202 through a first interface. The first interface may be a PCIe interface, a Common Flash Interface (CFI) or a Serial Peripheral Interface (SPI), a peripheral component interconnect standard (PCI) interface, a local bus (LocalBus) interface, or the like, which is not limited in this embodiment of the present application.

The programmable logic device 201 may include a status register, which may be used to record the operating status of the embedded processor 202 or other devices in the data processor 200, and the programmable logic device 201 may determine whether the embedded processor 202 has a fault by reading the information of the status register. Optionally, a first-in first-out (FIFO) channel may be established between the computer host 100, the programmable logic device 201 and the embedded processor 202, so that the computer host 100, the programmable logic device 201 and the embedded processor 202 may perform data interaction through the FIFO channel to improve the data transmission speed.

The data processor 200 may comprise a complex programmable logic device, not shown in fig. 1, which may communicate with the programmable logic device 201 via a first interface, and may also communicate with the embedded processor 202 via the first interface. That is, the complex programmable logic device may communicate with the programmable logic device 201 and the embedded processor 202, respectively. The data processor 200 may also comprise a memory (e.g. a Flash memory) not shown in fig. 1, which may be used to store the programs that need to be run, and which may communicate with the programmable logic device 201 via the CFI interface. In addition, the data processor 200 may further include a PCIe Switch (Switch), a GPU (GPU), a Digital Signal Processor (DSP), a disk array (RAID), and the like, which are not shown in fig. 1, and the embodiment of the present application is not limited thereto.

With the rapid development of data centers, communication capacity and computing capacity become two complementary important development directions of data center infrastructures. If the data center only focuses on the improvement of the computing power and the improvement of the communication infrastructure cannot keep up with, the overall system performance of the data center is still limited and cannot exert a real potential. In order to cope with increasingly large and complex data volumes, data Processing Units (DPUs) have come to work.

Currently, a data processor simulates a PCIe device at an Embedded Central Processing Unit (ECPU) end, and PCIe-related processing layer packets (TLPs) of a host are all forwarded to the embedded processor for processing. This means that when a PCIe emulator of the embedded processor or the system itself fails, two situations may occur: firstly, the host user service is influenced, and even the host hangs up; second, restoring the data processor requires restarting the host, resulting in interruption of all programs being run. Therefore, how to isolate faults so as not to affect the service processing of users is a problem to be solved by those skilled in the art.

In order to solve the above problem, embodiments of the present application provide a method for fault handling, which can effectively implement fault isolation, and does not need to restart a host computer, and can reduce the influence on the host computer to the maximum extent, thereby ensuring normal operation of the host computer.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for fault handling according to an embodiment of the present disclosure. It is understood that the method may be used in the system architecture shown in fig. 1, and may be specifically executed by the programmable logic device shown in fig. 1, and the method may include the following steps S201 to S202, where:

step S201: in response to detecting a failure of an embedded processor in the data processor, the programmable logic device enters a solution mode, which includes sending a hot plug interrupt signal to the host computer to isolate the host computer from the failure of the embedded processor.

Along with the increase of the number of the PCIe devices externally connected with the computer host, the probability of the fault of the PCIe device is increased. The fault of the PCIe device may affect the normal operation of the host computer, and in a serious case, may even cause the host computer to hang up. Therefore, the processing of the failed PCIe device is an important link for maintaining the normal operation of the host computer. Since the data processor may employ a PCIe standard protocol for data transmission, the data processor may be a PCIe device. As shown in fig. 1, the data processor may include a programmable logic device and an embedded processor. The programmable logic device may be an FPGA, an SOC, an ASIC, or the like, may also be a multi-core processor, or may also be another programmable logic device, which is not limited in this embodiment of the present application. For a data processor, the data processor may emulate a PCIe device on the embedded processor side, and the embedded processor may be configured to process PCIe-related TLP packets forwarded to the host computer therein. This means that when the PCIe emulator of the embedded processor or the system itself fails, the failure of the embedded processor may affect the normal operation of the host computer.

In order to avoid the embedded processor from failing to affect the normal operation of the host computer, the failed embedded processor needs to be isolated from the host computer. In the embodiment of the application, the embedded processor with the fault can be isolated from the computer host by using the programmable logic device. Specifically, after detecting a failure of the embedded processor, the programmable logic device enters a surrogate mode to complete the fault isolation. One possible implementation manner for the programmable logic device to enter the answer mode may be: the programmable logic device sends a hot plug interrupt signal to the computer host, wherein the hot plug interrupt signal is used for indicating the embedded processor to execute hot plug operation. When the host computer receives the hot plug interrupt signal, the host computer considers that the embedded processor is hot-unplugged, and the host computer does not perform service interaction with the embedded processor any more, so that the host computer is disconnected from the embedded processor in communication, and fault isolation is completed.

It can be seen that after the embedded processor is detected to have a fault, the programmable logic device sends a hot plug interrupt signal to the computer host in a reply mode, so that the fault isolation between the computer host and the embedded processor can be realized. The fault isolation mode is thorough, so that the fault can be prevented from being diffused to the whole computer host. In addition, the fault isolation method does not need to restart the host computer, is simple and efficient, thereby reducing the influence on the host computer to the maximum extent and ensuring that other services of the host computer can normally run without interference.

In a possible implementation manner, before step S201, a fault detection stage may be further included, and specifically, the following steps may be included:

in response to reading a preset flag bit from the computer host, the programmable logic device determines that the embedded processor is faulty, and the preset flag bit is used for indicating that the embedded processor is faulty.

In the embodiment of the application, FIFO channels can be established among the computer host, the programmable logic device and the embedded processor, and the computer host, the programmable logic device and the embedded processor can perform data interaction through the FIFO channels. Specifically, the host computer may send a heartbeat packet to the embedded processor according to a predetermined time period, and send the heartbeat packet to the embedded processor through the FIFO channel, and after receiving the heartbeat packet, the embedded processor replies a heartbeat packet to the host computer to maintain mutual communication between the host computer and the embedded processor. If the host computer receives the heartbeat packet sent by the embedded processor within the preset time length, the embedded processor can be judged to be in a normal working state, namely the embedded processor does not break down. If the host computer does not receive the heartbeat packet sent by the embedded processor when the preset time length is exceeded, the embedded processor can be judged to be in failure, namely the embedded processor is possibly hung.

In a possible embodiment, the host computer may be provided with a preset program, and after detecting that the embedded processor fails, a preset flag bit may be set in the preset program, where the preset flag bit is used to indicate that the embedded processor fails. After detecting the preset flag, the programmable logic device may determine that the embedded processor has a fault, and enter a fault isolation stage, for example, step S201 may be executed to complete fault isolation between the host computer and the embedded processor.

In one possible embodiment, the host computer may be provided with a first status register, which may be used to store the operating status of the embedded processor. The state value of the first status register may be set to: "00" indicates that the embedded processor is working properly and "10" indicates that the embedded processor is malfunctioning. The preset flag bit may be an indicator for indicating that the embedded processor fails, and in this embodiment, the preset flag bit may be understood as a status value "10" of the first status register. When the host computer detects that the embedded processor works normally, the state value of the first state register keeps '00' unchanged. The host computer may modify the state value of the first status register from "00" to "10" after detecting that the embedded processor has failed, i.e. indicating that the embedded processor has failed. The programmable logic device can read the state value of the first state register in a polling mode, and when the state value of the first state register is read to be 10, namely when a preset flag bit is read, the embedded processor is determined to be in fault. After detecting the embedded processor is faulty, the programmable logic device enters a fault isolation stage, for example, step S201 may be executed to complete fault isolation between the computer host and the embedded processor.

It can be seen that FIFO channels are established among the computer host, the programmable logic device and the embedded processor, and the computer host, the programmable logic device and the embedded processor can perform data interaction through the FIFO channels. After the host computer detects that the embedded processor fails, a preset flag bit can be set for indicating that the embedded processor fails, and after the programmable logic device reads the preset flag bit, the embedded processor can be determined to fail. The fault detection mode has the advantages of small burden of the programmable logic device, simplicity and high efficiency.

In a possible embodiment, the specific implementation manner of the fault detection can also be realized through steps A1-A2:

step A1: the programmable logic device acquires register information of the status register, and the register information is used for recording the running state of the embedded processor.

Step A2: and the programmable logic device judges whether the embedded processor has a fault according to the register information.

In the embodiment of the application, the programmable logic device can be provided with a status register, and the status register is used for storing the running state of the embedded processor. The register information stored by the status register is used for recording the running state of the embedded processor.

Taking the programmable logic device including 1 32-bit status register as an example, the bits of the status register represent meanings as shown in table 1:

TABLE 1

Bit0	Bit1	Bit2-Bit3	Bit4-Bit8	Bit9-Bit31
					First zone bit	Second zone bit	Retention	Fault information code	Retention

The concrete description is as follows:

a Bit0: the programmable logic device judges whether the embedded processor works normally, and when the embedded processor works normally, a first zone bit in a bit state register is arranged in the programmable logic device within preset time. Where set may be understood as setting the corresponding flag to either 0 or 1.

A Bit1: the second flag Bit is used for storing the running state of the embedded processor fed back by the complex programmable logic device, and the programmable logic device can store the information fed back by the complex programmable logic device in Bit1.

Bit2-Bit3: and reserving bits.

Bit4-Bit8: if the programmable logic device detects that the embedded processor fails, corresponding failure information codes can be stored to the Bit4-Bit8 according to the failure source and the failure type of the embedded processor.

Bit9-Bit31: and reserving bits.

After the register information of the status register is obtained, whether the embedded processor fails or not may be determined according to the register information, and the specific implementation manner may refer to the following description, which is not described herein again.

It can be seen that whether the embedded processor fails or not is judged according to the register information by acquiring the register information of the status register. The self-detection mode of the embedded processor is realized through the state register provided by the programmable logic device, the embedded processor is simple and efficient, the number of devices participating in detection is small, and the accuracy of fault detection is high.

In a possible implementation manner, step A2 may specifically include the following steps:

the programmable logic device reads a state value of a first zone bit in the state register from the register information according to a preset period; the programmable logic device judges whether the first zone bit is set according to the read state value of the first zone bit; in response to the first flag bit not being set at the preset time, the programmable logic device determines that the embedded processor is malfunctioning.

In the embodiment of the application, the self-detection of the embedded processor can be realized through a status register provided by the programmable logic device. Specifically, the operating state of the embedded processor may be associated with a first flag Bit of a status register (e.g., bit0 shown in Table 1). The programmable logic device periodically (e.g., 1 minute) sends a discovery request to the embedded processor. If the embedded processor receiving the discovery request works normally, a discovery response is replied to the programmable logic device, that is, the discovery request sent by the programmable logic device is responded, and at this time, the first flag bit can be set. Where set may be understood as setting the corresponding flag to either 0 or 1. For example, the programmable logic device responds to a discovery request sent by the embedded processor, and if the state value of the first flag bit in the previous cycle is "0", the state value of the first flag bit is set to 1; or if the state value of the first flag bit in the previous cycle is "1", setting the state value of the first flag bit to 0. That is, if the embedded processor is operating properly, the state value of the first flag bit of the status register in the programmable logic device will change periodically. If the embedded processor fails, the embedded processor cannot respond to the discovery request sent by the embedded processor, and the first flag bit cannot be set at the moment. For example, the programmable logic device cannot respond to a discovery request sent by the embedded processor, and if the state value of the first flag bit in the previous cycle is "0", the state value of the first flag bit at this time is still "0"; or if the state value of the first flag bit in the previous cycle is "1", the state value of the first flag bit at this time is also "1". That is, if the embedded processor fails, the state value of the first flag bit of the status register in the programmable logic device will not change periodically. Therefore, in the embodiment of the present application, the programmable logic device may read the state value of the first flag bit in the status register from the register information according to the preset period, then determine whether the first flag bit is set according to the read state value of the first flag bit, and determine that the embedded processor fails if the first flag bit is not set when the preset time is reached. The preset period and the preset time may be determined according to an actual application scenario, which is not limited in the embodiment of the present application.

It can be seen that the programmable logic device may read a state value of a first flag bit in the status register from the register information according to a preset period, then judge whether the first flag bit is set according to the read state value of the first flag bit, and determine that the embedded processor fails if the first flag bit is not set when the preset time is reached. The self-detection mode of the embedded processor is realized through the state register provided by the programmable logic device, the embedded processor is simple and efficient, the number of devices participating in detection is small, and the accuracy of fault detection is high.

In a possible embodiment, step A2 may further include the steps of:

the programmable logic device reads a state value of a second zone bit in the state register from the register information, the state value of the second zone bit is associated with a first signal fed back by the complex programmable logic device, and the complex programmable logic device is used for detecting the running state of the embedded processor and feeding back the first signal to the programmable logic device according to the running state of the embedded processor; and in response to the reading of the state value of the second zone bit as a preset value, the programmable logic device determines that the embedded processor has a fault.

In the embodiment of the application, the programmable logic device can detect whether the embedded processor fails or not through the complex programmable logic device connected with the embedded processor. The complex programmable logic device can be respectively communicated with the embedded processor and the programmable logic device, and a watchdog module can be arranged in the complex programmable logic device and used for monitoring the running condition of the embedded processor. When the embedded processor works normally, a feedback signal is sent to a timer of the watchdog module within a preset time (for example, 1 second or 0.5 second) to clear the timer, so that the function of feeding the watchdog is realized. If the embedded processor does not feed the timer within the preset time, the complex programmable logic device can feed back a first signal to the programmable logic device to improve that the embedded processor does not respond within the preset time. When the programmable logic device receives the first signal, the programmable logic device may modify a state value of a Bit (e.g., bit1 shown in table 1) in the state register corresponding to the complex programmable logic device to a preset value to indicate that the embedded processor fails.

For example, the state value of the second flag Bit (see Bit1 shown in table 1) in the status register may be set as: "00" indicates that the embedded processor is working properly and "10" indicates that the embedded processor is malfunctioning. The preset value may be a flag indicating that the embedded processor fails, and in this embodiment, the preset value may be understood as a state value "10" of the second flag bit. And when the programmable logic device does not receive the first signal for feeding back the embedded processor to have a fault, the state value of the second flag bit is kept unchanged at '00'. The programmable logic device may modify the state value of the second flag bit from "00" to "10" after receiving the first signal, i.e., indicating that the embedded processor has failed. And when the programmable logic device reads that the state value of the second zone bit is '10', namely the preset zone bit is read, the embedded processor is determined to have a fault. In a possible embodiment, the second flag bit and the first flag bit may also be the same flag bit, that is, the same bit is used to implement the functions of the first flag bit and the second flag bit. For example, bit1 may be used to store information of the running state of the embedded processor fed back by the complex programmable logic device, and may also be used to store information of whether the embedded processor is working normally or not. After detecting the embedded processor is faulty, the programmable logic device enters a fault isolation stage, for example, step S201 may be executed to complete fault isolation between the computer host and the embedded processor.

It can be seen that the complex programmable logic device is used for carrying out fault detection on the embedded processor, and then the detection result of the complex programmable logic device is associated with the state register of the programmable logic device. And if the state value of the second zone bit in the state register is read to be a preset value, determining that the embedded processor has a fault. Therefore, the fault detection of the embedded processor is carried out by utilizing the complex programmable logic device, the fault detection means of the embedded processor is expanded, the burden of the programmable logic device is relatively small, and the fault detection accuracy can be improved.

In one possible implementation, the solution mode further includes the programmable logic device sending a first processing layer packet for error reporting to the host computer to isolate the host computer from embedded processor failures.

In this embodiment, when the host computer performs service interaction with the embedded processor, the host computer may first send the service data to the programmable logic device, and then the programmable logic device forwards the service data to the embedded processor for processing. If the embedded processor fails, the service data cannot be processed. At this time, after the host computer sends the service data to the programmable logic device, the programmable logic device can enter a response mode to prompt the host computer that the embedded processor fails, so that the host computer and the embedded processor are isolated from each other. Specifically, the programmable logic device may send a first processing layer packet to the computer host, where the first processing layer packet is used to report an error to the computer host to prompt the embedded processor to have a fault, so as to implement fault isolation between the computer host and the embedded processor.

It can be seen that after the embedded processor is detected to have a fault, the programmable logic device sends the first processing layer data packet for error reporting to the computer host in place of the answer, so that the fault isolation between the computer host and the embedded processor can be realized. The fault isolation method does not need to restart the host computer, can reduce the influence on the host computer, and can ensure that other services of the host computer can normally run without interference.

In one possible implementation, the programmable logic device sending a first processing layer packet for error reporting to the host computer may include the steps of:

acquiring register information of a status register, wherein the register information is used for recording the running status of the embedded processor; reading the fault source and the fault type of the embedded processor from the register information; generating a first processing layer data packet for error reporting according to a fault source and a fault type; and sending the first processing layer data packet to the computer host.

In the PCIe standard protocol, the types of failures occurring in the PCIe device are divided according to the severity of the failure: correctable Errors (CE), non-fatal uncorrectable errors (NFE), and fatal uncorrectable errors (FE). Wherein correctable errors may be automatically identified and corrected or recovered by hardware; non-fatal uncorrectable errors are typically handled directly by device driver software; fatal uncorrectable errors are typically handled by system software and typically require operations such as reset. The source of the fault may be understood as a particular malfunctioning device.

In this embodiment, a plurality of embedded processors may exist, and taking the presence of the embedded processor a and the embedded processor B as an example, a one-to-one correspondence relationship may be established between the embedded processor a and the embedded processor B and each bit of a status register in the programmable logic device, respectively. As shown in table 1, the operating state of embedded processor a may be stored in Bit4, and the operating state of embedded processor B may be stored in Bit 5. Accordingly, the status values of Bit4 and Bit5 in the status register may be set to "00" indicating normal operation, "01" indicating correctable error, "10" indicating non-fatal uncorrectable error, and "11" indicating fatal uncorrectable error. For example, when the status value of Bit4 read in the programmable logic device is "00", it can be determined that the embedded processor a is operating normally. When the programmable logic device reads that the state value of the Bit5 is '01', the fault source can be determined to be an embedded processor B, and the fault type is a correctable error; or when the programmable logic device reads that the state value of Bit5 is "10", it can be determined that the failure source is the embedded processor B and the failure type is a non-fatal uncorrectable error. Then, a first processing layer data packet can be generated according to the fault source and the fault type of the processor in the register information, and the first processing layer data packet is fed back to the host computer for error reporting, so that fault isolation is completed. The first processing layer packet may further include a PCIe device identifier of the failed embedded processor, where the PCIe device identifier may be an identifier that identifies a PCIe device in a PCIe bus system, and specifically, the PCIe device identifier may be a Bus Device Function (BDF) of the PCIe device or address information of the PCIe device, so as to quickly locate the failed embedded processor.

Therefore, the programmable logic device reads the fault source and the fault type of the embedded processor from the register information, generates the first processing layer data packet according to the fault source and the fault type, and feeds the first processing layer data packet back to the computer host, so that the embedded processor with the fault can be quickly positioned, and the accuracy of fault positioning can be improved.

Step S202: and in response to detecting that the embedded processor is repaired, the programmable logic device sends a hot plug signal to the computer host, and exits the answering mode to complete fault recovery.

In this embodiment of the present application, the fault repairing manner may be self-repairing or restarting of an embedded processor that has a fault, and the present application embodiment does not limit this. And after the embedded processor can work normally, sending a recovery request to the programmable logic device, wherein the recovery request is used for indicating that the embedded processor fault recovery is completed. After receiving the recovery request, the programmable logic device determines that the repair of the embedded processor is completed according to the recovery request, and may send a hot-plug signal to the host computer, where the hot-plug signal may be used to instruct the embedded processor to perform a hot-plug operation. After receiving the hot plug signal, the host computer considers that the embedded processor has executed the hot plug, and the host computer can perform normal business interaction with the embedded processor. At this time, the programmable logic device exits the answering agency mode, the computer host can be communicated with the embedded processor again, and the fault recovery is completed.

As can be seen in the method shown in fig. 2, after detecting that the embedded processor in the data processor has a fault, the programmable logic device may enter a proxy mode, and simply and efficiently complete fault isolation by sending a hot plug interrupt signal to the host computer, the hot plug interrupt signal being used to instruct the embedded processor to perform a hot plug operation to disconnect the host computer from the embedded processor. Therefore, the problem that in the traditional technical scheme, once the embedded processor fails, the host computer needs to be restarted to cause interruption of all running programs can be avoided, so that the influence on the host computer is reduced to the maximum extent, and the normal running of the host computer is ensured. After the completion of the repair of the embedded processor is detected, the programmable logic device sends a hot plug signal to the computer host, and the hot plug signal is used for indicating the embedded processor to execute hot plug operation and quits the answer-substituting mode, so that the computer host and the embedded processor are communicated again, the fault recovery is completed quickly, and the fault processing efficiency is improved. In addition, according to the embodiment of the application, external tools such as a BMC system and a management and control platform are not required to participate, the dependence degree can be reduced, and the reliability is higher.

The method of the embodiments of the present application is explained in detail above, and the apparatus of the embodiments of the present application is provided below.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present disclosure. The device is applied to a programmable logic device. As shown in fig. 3, the apparatus 300 for fault handling includes a fault isolation unit 301 and a fault recovery unit 302, and the detailed description of each unit is as follows:

a fault isolation unit 301, configured to, in response to detecting that an embedded processor in a data processor fails, enter a solution mode, where the solution mode includes sending a hot plug interrupt signal to a host computer to isolate the host computer from the embedded processor, where the hot plug interrupt signal is used to indicate that the embedded processor performs a hot plug operation;

a fault recovery unit 302, configured to send a hot-plug signal to the host computer in response to detecting that the repair of the embedded processor is completed, and exit from the proxy mode to complete fault recovery, where the hot-plug signal is used to indicate that the embedded processor executes a hot-plug operation.

In a possible implementation manner, the apparatus 300 for fault handling may further include a fault detection unit not shown in fig. 3, where the fault detection unit may be specifically configured to obtain register information of the status register, where the register information is used to record an operating state of the embedded processor; and judging whether the embedded processor fails according to the register information.

In a possible implementation, the apparatus 300 for fault handling may further include a fault detection unit not shown in fig. 3, and the fault detection unit may be specifically configured to read the state value of the first flag bit in the status register from the register information according to a preset cycle; judging whether the first zone bit is set or not according to the read state value of the first zone bit; and determining that the embedded processor fails in response to the first flag bit not being set when the preset time is reached.

In a possible implementation, the apparatus 300 for fault handling may further include a fault detection unit not shown in fig. 3, and the fault detection unit may be specifically configured to read a state value of a second flag bit in the status register from the register information, where the state value of the second flag bit is associated with a first signal fed back by a complex programmable logic device, and the complex programmable logic device is configured to detect an operating state of the embedded processor and feed back the first signal to the programmable logic device according to the operating state of the embedded processor; and determining that the embedded processor fails in response to the reading of the state value of the second zone bit as a preset value.

In a possible implementation, the fault isolation unit 301 is further configured to send a first layer packet for error reporting to the host computer, so as to isolate the host computer from the embedded processor fault.

In a possible implementation manner, the fault isolation unit 301 is specifically configured to obtain register information of the status register, where the register information is used to record an operating state of the embedded processor; reading the fault source and the fault type of the embedded processor from the register information; generating a first processing layer data packet for error reporting according to the fault source and the fault type; and sending the first processing layer data packet to the host computer.

In a possible implementation manner, there is a fifo queue channel between the computer host, the programmable logic device, and the embedded processor, and the apparatus 300 for fault handling may further include a fault detection unit, not shown in fig. 3, which is specifically configured to determine that the embedded processor is faulty in response to reading a preset flag bit from the computer host, where the preset flag bit is used to indicate that the embedded processor is faulty.

It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in fig. 2.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 4, the computer device 400 comprises a processor 401, a memory 402 and a communication interface 403, wherein the memory 402 stores a computer program 404. The processor 401, memory 402, communication interface 403, and computer program 404 may be connected by a bus 405.

When the computer device is a programmable logic device, the computer program 404 is configured to execute the following steps:

in response to detecting that an embedded processor in a data processor fails, entering a solution mode, wherein the solution mode comprises sending a hot plug interrupt signal to a computer host to isolate the computer host from the embedded processor, and the hot plug interrupt signal is used for indicating the embedded processor to execute hot plug operation;

and in response to detecting that the embedded processor is repaired, sending a hot plug signal to the host computer, and exiting the answering mode to complete fault recovery, wherein the hot plug signal is used for indicating the embedded processor to execute hot plug operation.

In one possible implementation, the programmable logic device includes a status register, and before the programmable logic device enters the solution mode in response to detecting a failure of an embedded processor in the data processor, the computer program 404 is further operable to execute the instructions of:

acquiring register information of the status register, wherein the register information is used for recording the running status of the embedded processor;

and judging whether the embedded processor fails according to the register information.

In a possible implementation manner, in the aspect of determining whether the embedded processor fails according to the register information, the computer program 404 is specifically configured to execute the following steps:

reading a state value of a first zone bit in the state register from the register information according to a preset period;

judging whether the first flag bit is set or not according to the read state value of the first flag bit;

and determining that the embedded processor fails in response to the first flag bit not being set when the preset time is reached.

reading a state value of a second flag bit in the state register from the register information, wherein the state value of the second flag bit is associated with a first signal fed back by a complex programmable logic device, and the complex programmable logic device is used for detecting the running state of the embedded processor and feeding back the first signal to the programmable logic device according to the running state of the embedded processor;

and determining that the embedded processor fails in response to the reading of the state value of the second zone bit as a preset value.

In one possible embodiment, the solution mode further comprises:

and sending a first processing layer data packet for error reporting to the computer host so as to isolate the computer host from the embedded processor in a fault.

In one possible implementation, where the programmable logic device includes a status register, the computer program 404 is specifically configured to execute the instructions of:

reading the fault source and the fault type of the embedded processor from the register information;

generating a first processing layer data packet for error reporting according to the fault source and the fault type;

and sending the first processing layer data packet to the computer host.

In one possible implementation, there is a first-in-first-out queue path between the host computer, the programmable logic device, and the embedded processor, and the computer program 404 is further operable to execute the following steps before the programmable logic device enters a solution mode in response to detecting a failure of the embedded processor in the data processor:

and determining that the embedded processor fails in response to reading a preset flag bit from the computer host, wherein the preset flag bit is used for indicating that the embedded processor fails.

Those skilled in the art will appreciate that only one memory and processor are shown in fig. 4 for ease of illustration. In an actual terminal or server, there may be multiple processors and memories. The memory 402 may also be referred to as a storage medium or a storage device, and the like, which is not limited in this application.

It should be understood that, in the embodiment of the present application, the processor 401 may be a Central Processing Unit (CPU), and the processor may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like.

It will also be appreciated that the memory 402, referred to in this application embodiment, may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

It should be noted that when the processor 401 is a general purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, the memory (memory module) is integrated into the processor.

It should be noted that the memory 402 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The bus 405 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for clarity of illustration the various buses are labeled as buses in the figures.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

In the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various Illustrative Logical Blocks (ILBs) and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the fault handling methods as described in the above method embodiments.

Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to perform some or all of the steps of any one of the fault handling methods as described in the above method embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for fault handling is applied to a programmable logic device, and is characterized by comprising the following steps:

in response to detecting that an embedded processor in a data processor fails, the programmable logic device enters a solution mode, wherein the solution mode comprises sending a hot plug interrupt signal to a host computer to isolate the host computer from the embedded processor, and the hot plug interrupt signal is used for indicating that the embedded processor executes a hot plug operation;

2. The method of claim 1, wherein the programmable logic device includes a status register, and wherein prior to the programmable logic device entering the solution mode in response to detecting a failure of an embedded processor in the data processor, further comprising:

the programmable logic device acquires register information of the status register, wherein the register information is used for recording the running status of the embedded processor;

and the programmable logic device judges whether the embedded processor has a fault according to the register information.

3. The method of claim 2, wherein the determining, by the programmable logic device, whether the embedded processor is faulty according to the register information comprises:

the programmable logic device reads a state value of a first zone bit in the state register from the register information according to a preset period;

the programmable logic device judges whether the first flag bit is set according to the read state value of the first flag bit;

in response to the first flag bit not being set by a preset time, the programmable logic device determines that the embedded processor is malfunctioning.

4. The method of claim 2, wherein the determining, by the programmable logic device, whether the embedded processor is malfunctioning based on the register information comprises:

the programmable logic device reads a state value of a second flag bit in the state register from the register information, the state value of the second flag bit is associated with a first signal fed back by a complex programmable logic device, and the complex programmable logic device is used for detecting the running state of the embedded processor and feeding back the first signal to the programmable logic device according to the running state of the embedded processor;

and in response to the reading of the state value of the second zone bit as a preset value, the programmable logic device determines that the embedded processor has a fault.

5. The method of any of claims 1 to 4, wherein the reply mode further comprises:

and the programmable logic device sends a first processing layer data packet for error reporting to the computer host so as to isolate the computer host from the embedded processor in a fault manner.

6. The method of claim 5, wherein the programmable logic device includes a status register, and wherein sending the first processing layer packet for error notification to the computer host by the programmable logic device comprises:

the programmable logic device reads the fault source and the fault type of the embedded processor from the register information;

the programmable logic device generates a first processing layer data packet for error reporting according to the fault source and the fault type;

the programmable logic device sends the first processing layer data packet to the host computer.

7. The method of any of claims 1 to 4, wherein a first-in-first-out queue path exists between the computer host, the programmable logic device, and the embedded processor, and wherein prior to the programmable logic device entering a solution mode in response to detecting a failure of the embedded processor in the data processor, the method further comprises:

and in response to reading a preset flag bit from the computer host, the programmable logic device determines that the embedded processor fails, wherein the preset flag bit is used for indicating that the embedded processor fails.

8. A fault handling apparatus for a programmable logic device, comprising:

and the fault recovery unit is used for responding to the detection that the embedded processor is repaired, the programmable logic device sends a hot plug signal to the computer host and exits the answering agency mode to complete fault recovery, and the hot plug signal is used for indicating the embedded processor to execute hot plug operation.

9. A computer arrangement comprising a processor, a memory and a communication interface, wherein the memory stores a computer program configured for execution by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1-7.

10. A computer-readable storage medium or computer program product, characterized in that the computer-readable storage medium stores a computer program, the computer program causing a computer to execute to implement the method of any one of claims 1-7, or the computer program product is for implementing the method of any one of claims 1-7.