WO2022155919A1

WO2022155919A1 - Fault handling method and apparatus, and system

Info

Publication number: WO2022155919A1
Application number: PCT/CN2021/073396
Authority: WO
Inventors: 胡成; 董钰山
Original assignee: 华为技术有限公司
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-07-28
Also published as: CN116724297A

Abstract

A fault handling method and apparatus, and a system, which are used to maintain high reliability of a peripheral component interconnect express (PCIe) system. The method comprises: a central processing unit acquiring abnormal interrupt information corresponding to an abnormal PCIe port; according to the abnormal interrupt information, determining the type of fault corresponding to the abnormal PCIe port; and when it is determined that the fault is a recoverable fault, resetting the abnormal PCIe port and a communications link of the abnormal PCIe port. By means of the fault handling solution, the communication service capability of an abnormal PCIe port can be recovered in a timely manner, without affecting other PCIe ports in a PCIe system, thereby being beneficial to maintaining the high reliability of the PCIe system.

Description

A fault handling method, device and system

technical field

The present application relates to the field of communication technologies, and in particular, to a fault handling method, device, and system.

Background technique

Peripheral component interconnect express (PCIe) is a high-speed short-distance communication interface. The communication interface can quickly read and write memory, and can support ultra-high-bandwidth communication. It has been widely used in various fields such as network, communication, storage, industrial and consumer electronic products.

The main components of a PCIe system include a root complex (root complex, RC), a switch node (switch), and an end node (endpoint, EP). Among them, the root complex is used to manage all buses and all nodes in the PCIe system, and is a bridge for communication between nodes in the PCIe system. A root complex may contain multiple PCIe ports, and the root complex is respectively connected to multiple nodes, such as multiple end nodes or multiple switching nodes, through multiple PCIe ports. A switch node can be used to connect the root complex and other switch nodes, or to connect the root complex and end nodes, and is a data forwarding node in the PCIe system. An end node is an end device, such as a peripheral device, for receiving data or sending data.

In the existing solution, when an uncorrectable error occurs on a PCIe port, in order to maintain the availability of the PCIe port, the entire PCIe system will recover the PCIe port by restarting. However, this means that all PCIe ports of the entire PCIe system are in an unavailable state during the restart period, which obviously reduces the reliability of the PCIe system.

SUMMARY OF THE INVENTION

The present application provides a fault handling method, device and system to solve the technical problem of low reliability of the PCIe system caused by restarting the entire PCIe system to restore an abnormal PCIe port in the prior art.

In a first aspect, the present application provides a fault handling method, which is applicable to a central processing unit, and the central processing unit can be directly or indirectly connected to each PCIe port in a PCIe system. The method includes: after the central processing unit receives the abnormal interruption information reported by the abnormal PCIe port, firstly determines the fault type corresponding to the abnormal PCIe port according to the abnormal interruption information, and when it is determined that the fault type is a recoverable fault, resetting the abnormal PCIe port Port and the communication link of the abnormal PCIe port. The communication link of the abnormal PCIe port is used to connect the abnormal PCIe port and the PCIe node. In the above design, by resetting the recoverable abnormal PCIe port and the corresponding communication link, not only the availability of the abnormal PCIe port can be restored in time, but the communication service capability of the abnormal PCIe port can be maintained, and the entire PCIe system does not need to be restarted. The fault handling solution will not affect other PCIe ports in the PCIe system while recovering the different PCIe ports, which helps to maintain the high reliability of the PCIe system.

In a possible design, the abnormal PCIe port is provided in the root complex, and the PCIe node can be any node connected to the root complex, such as an end node, a switch node, or a bridge node. Through this design, the scheme can not only detect the link abnormality of the end node, but also detect the link abnormality of other types of nodes (such as switching nodes or bridge nodes). Various types of link anomalies can maintain the availability of the PCIe system to the greatest extent possible.

In a possible design, the central processing unit may reset the abnormal PCIe port by resetting the media access control layer (media access control, MAC) logic of the PCIe core where the abnormal PCIe port is located. In this design, by only resetting the MAC layer logic related to the port and not resetting other logic unrelated to the port, the abnormality of the abnormal PCIe port can be recovered in a targeted manner, which improves the efficiency of the PCIe port reset. On the basis of this, the processing resources of the central processing unit and PCIe core are saved.

In a possible design, the central processing unit can realize the communication link to the abnormal PCIe port by resetting the serializer/deserializer (SerDes) link parameter corresponding to the PCIe core where the abnormal PCIe port is located reset. In this way, the design can calibrate the current SerDes link parameters of the abnormal PCIe port when the surrounding environment changes, and restore the communication quality of the abnormal PCIe port by adjusting them to parameters suitable for the current environment.

In a possible design, before resetting the serializer/demodulator SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located, the CPU can also disconnect the abnormal PCIe port and mount it on the abnormal PCIe port After resetting the serializer/demodulator SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located, rebuild the abnormal PCIe port and the PCIe mounted on the abnormal PCIe port Communication link between nodes. In this design, by removing each node mounted on the abnormal PCIe port before resetting the abnormal PCIe port, the abnormal PCIe port can be decoupled from other nodes in the PCIe core, which is helpful to realize the independent reset of the abnormal PCIe port .

In a possible design, if the CPU determines that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, it can disable the abnormal PCIe port and the communication link of the abnormal PCIe port to save the resources of the PCIe core and try to avoid abnormality The phenomenon occurs that the unrecoverable fault of the PCIe port spreads to the entire PCIe core and causes the service failure of the entire PCIe core.

In one possible design, recoverable failures may include one or more of the following: data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error , The response of the advanced extensible interface AXI bus is wrong. In this way, the design can repair various types of errors related to PCIe ports, such as data link layer packet transmission timeout errors and transaction layer packet write configuration space too many retries errors related to PCIe node communication, and self data. Storage-related two-bit data errors, and AXI bus response errors related to CPU transfers help maintain PCIe port availability more fully.

In a possible design, the abnormal interrupt information corresponding to the abnormal PCIe port may include one or more of the following contents: the identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, and the PCIe core where the abnormal PCIe port is located. , and the ID of the central processing unit (CPU) connected to the PCIe core. In this way, by analyzing the abnormal interrupt information corresponding to the abnormal PCIe port, the central processing unit can obtain some features related to the current abnormality, so as to calculate the fault type corresponding to the abnormal PCIe port.

In a possible design, the central processing unit may obtain abnormal interrupt information corresponding to the abnormal PCIe port from a preset work queue, where the predetermined work queue stores abnormal interrupt information corresponding to the abnormal PCIe port in each PCIe core. In this way, the design can centrally manage port exceptions occurring in each PCIe core through a preset work queue, effectively improving the flexibility of recovery of abnormal PCIe ports in each PCIe core.

In a second aspect, the present application provides a fault handling device, including a processor and a memory, and a computer program is stored in the memory. In implementation, by calling the computer program stored in the memory, the processor can perform the following operations: obtain abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port, determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information, If the fault type corresponding to the PCIe port is a recoverable fault, reset the abnormal PCIe port and the communication link of the abnormal PCIe port. The communication link of the abnormal PCIe port is used to connect the PCIe port and the PCIe device.

In one possible design, a PCIe node may be one or more of an end node, a switch node, or a bridge node.

In one possible design, the fault handling device may also include an advanced configuration and power management interface (ACPI). By invoking ACPI, the processor can execute the MAC logic of the media access control layer of the PCIe core where the abnormal PCIe port is located.

In a possible design, the fault handling device may further include SerDes firmware of a serial demodulator. By calling ACPI, the processor may also call the MAC logic of the media access control layer of the PCIe core where the abnormal PCIe port is located after resetting the MAC logic of the PCIe core. The SerDes firmware can reset the SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located by calling the SerDes firmware.

In a possible design, the fault handling device may further include a PCIe driver. By calling the PCIe driver, the processor can disconnect the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port, and call ACPI. By calling the SerDes firmware, it can reset the PCIe core corresponding to the abnormal PCIe port. After setting the SerDes link parameters, and returning to call the PCIe driver, by returning to call the PCIe driver, the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port can be rebuilt.

In one possible design, the memory can include public and private registers:

In one case, PCIe driver and ACPI can be stored in public registers, and SerDes firmware can be stored in private registers, so as to privacy the link reset method in SerDes firmware and effectively protect the implementation logic of SerDes link parameter reset;

In another case, the PCIe driver can be stored in the public register, and the ACPI and SerDes firmware can be stored in the private register, so as to privacy the port reset method in ACPI and the link reset method in SerDes firmware, effectively protecting the overall logic of fault handling .

In a possible design, the processor can also perform the following operations by calling the computer program stored in the memory: in the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, disable the abnormal PCIe port and the abnormal PCIe port. communication link.

In one possible design, recoverable failures may include one or more of the following: data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error , The response of the advanced extensible interface AXI bus is wrong.

In a possible design, the abnormal interrupt information corresponding to the abnormal PCIe port may include one or more of the following: the identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, and the information of the PCIe core where the abnormal PCIe port is located. ID, the ID of the central processing unit CPU connected to the PCIe core.

In a possible design, the fault handling device may further include a communication interface, and the processor specifically performs the following operations by calling the computer program stored in the memory: the processor receives the abnormal interrupt information corresponding to the abnormal PCIe port through the communication interface, and The abnormal interrupt information corresponding to the abnormal PCIe port is added to the preset work queue, and the abnormal interrupt information corresponding to the abnormal PCIe port is obtained from the preset work queue. The preset work queue is used to store abnormal interrupt information corresponding to abnormal PCIe ports in each PCIe core.

In a third aspect, the present application provides a fault handling apparatus, the apparatus including a module, a unit or a circuit for performing any one of the possible design methods of any of the above aspects. These modules, units or circuits can be implemented by hardware, or by executing corresponding software by hardware.

In a fourth aspect, the present application provides a chip, which may include a processor and a communication interface, where the processor is configured to read an instruction through the communication interface, so as to execute the fault handling method according to any one of the above first aspects.

In a fifth aspect, the present application provides a fault handling system, including a central processing unit and a peripheral component interconnection and transmission PCIe core, the PCIe core includes a root complex and at least one PCIe node, the central processing unit is connected to the root complex, and the root complex is in the root complex. At least one PCIe port is included, and the root complex is connected to at least one PCIe node through the at least one PCIe port. The root complex can be used to generate abnormal interrupt information corresponding to an abnormal PCIe port in at least one PCIe port and report it to the central processing unit, and the central processing unit can be used for troubleshooting according to any one of the above first aspects. Method to troubleshoot the abnormal PCIe port.

In a sixth aspect, the present application provides a computer-readable storage medium, the computer-readable medium stores a program code, when the program code is run on a computer, the computer is made to perform the fault handling as described in any one of the above-mentioned first aspects. method.

In a seventh aspect, the present application provides a computer program product, including computer program code, which, when the computer program code is run on a computer, causes the computer to execute the fault handling method according to any one of the above-mentioned first aspect.

For the beneficial effects corresponding to any one of the above-mentioned second aspect to the seventh aspect of the present application, reference may be made to the beneficial effect described in any one of the above-mentioned first aspect, which will not be repeated here.

Description of drawings

FIG. 1 exemplarily shows a schematic diagram of a system architecture to which an embodiment of the present application is applicable;

FIG. 2 exemplarily shows a schematic flowchart of a fault handling method provided by an embodiment of the present application;

FIG. 3 exemplarily shows a schematic flowchart corresponding to a reset method provided by an embodiment of the present application;

FIG. 4 exemplarily shows a schematic diagram of a software and hardware architecture of a fault processing logic provided by an embodiment of the present application;

FIG. 5 exemplarily shows a schematic flowchart of another fault processing method provided by an embodiment of the present application;

FIG. 6 exemplarily shows a schematic structural diagram of a fault processing apparatus provided by an embodiment of the present application;

FIG. 7 exemplarily shows a schematic structural diagram of another fault processing apparatus provided by an embodiment of the present application.

Detailed ways

The fault handling method disclosed in this application can be applied to an electronic device that communicates based on a PCIe system. In some embodiments of the present application, the fault handling device may be an electronic device or an independent unit. When the fault handling device is an independent unit, the unit can be embedded in the electronic equipment, and can perform fault handling on the PCIe port of the electronic equipment, so as to maintain the high reliability of the PCIe system. In other embodiments of the present application, the fault processing apparatus may also be a unit packaged inside the electronic device, and is used to implement the fault processing function of the PCIe port of the electronic device. The electronic device may be a server, memory, test instrument, or a portable electronic device containing functions such as personal digital assistants and/or music players, such as mobile phones, tablet computers, wearable devices with wireless communication capabilities (such as smart watches), or in-vehicle equipment, etc. Exemplary embodiments of portable electronic devices include, but are not limited to, carry-on

Or portable electronic devices with other operating systems, such as laptops (Laptops) or desktop computers with touch-sensitive surfaces (eg, touch panels).

The present application will be described in further detail below with reference to the accompanying drawings. It should be noted that, in the description of the present application, "at least one" refers to one or more, wherein a plurality of refers to two or more. In view of this, in the embodiment of the present invention, "a plurality" may also be understood as "at least two". "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/", unless otherwise specified, generally indicates that the related objects are an "or" relationship.

FIG. 1 exemplarily shows a schematic diagram of a system architecture to which the embodiments of the present application are applicable. As shown in FIG. 1 , the system architecture includes at least one central processing unit (CPU) and at least one PCIe core, and at least one central processing unit and at least one PCIe core may correspond one-to-one, as shown in FIG. 1 The CPU 1 corresponds to the PCIe core 1, and the CPU 2 corresponds to the PCIe core 2. PCIe cores are also known as PCIe systems. Each PCIe core may include one root complex and at least one end node, and may also include at least one switch node and at least one bridge node. Among them, the root complex is used to initialize the system and configure the communication links between the nodes when constructing the PCIe core, so as to connect the CPU corresponding to the PCIe core with the switching nodes, end nodes and bridge nodes in the PCIe core. One or more of them are connected one by one. The central processing unit corresponding to the PCIe core can communicate with each node in the PCIe core by connecting to the root complex in the PCIe core. The switching node connects the upstream root complex with one or more of the downstream end nodes, switching nodes or bridge nodes, respectively, and is used to route the data of the upstream root complex to one or more downstream nodes, or respectively The data of each downstream node is routed to the upstream unique root complex, or the data of a downstream node can be flexibly routed to another downstream node in a point-to-point manner. The bridge node is used to realize the communication connection between the PCIe core and other PCI or other PCIe cores adopting other bus standards through non-transparent bridges (NTB) set in different bus systems. The end node is usually located in a terminal application (Application, APP), and is responsible for connecting the terminal APP with other nodes in the PCIe core and completing PCIe-based transaction transmission. In general, there are more end nodes in a PCIe core than other types of nodes.

The following takes PCIe core 1 shown in Figure 1 as an example to further introduce the node connection method in each PCIe core:

The PCIe core 1 includes a root complex 1 , a switch node 1 , a bridge node 1 and four end nodes, namely end node 1 , end node 2 , end node 3 , and end node 4 . Among them, switch node 1, end node 1 and bridge node 1 belong to the downstream nodes of root complex 1 (root complex 1 belongs to the upstream nodes of switch node 1, end node 1 and bridge node 1), while end node 2, end node 3 and end node 4 belong to the downstream node of switch node 1 (switch node 1 belongs to the upstream node of end node 2, end node 3 and end node 4). Upstream nodes and corresponding downstream nodes can be connected by a PCI bus (see the thick black line shown in Figure 1). The root complex 1 may also contain one or more PCIe ports, such as root ports (root Port, RP) 1, RP2, and RP3. The root complex 1 can connect the downstream switching node 1, end node 1 and bridge node 1 through RP1, RP2 and RP3 respectively. In this way, the root complex 1 can communicate with the switching node 1, its downstream end node 2, and the downstream end node 1 through RP1. For data routing between end node 3 and end node 4, data routing with end node 1 can be implemented through RP2, and data routing with a node in PCIe core 2 (eg, bridge node 2) can be implemented through RP3.

It should be noted that the above content is only an exemplary introduction, and the PCIe core may also include more or less nodes than those shown in FIG. 1 , such as a greater number of switch nodes, end nodes or bridge nodes. , or include other types of nodes than root complexes, switch nodes, end nodes, and bridge nodes. The root complex and the central processing unit can be connected one-to-one in the manner shown in Figure 1, or can be connected in a one-to-many or many-to-one manner. For example, a root complex can also be connected to at least two central processing units, respectively. Alternatively, one central processing unit can also be connected to at least two root complexes, etc., respectively. In addition, the central processing unit and the PCIe core can be deployed in the same physical entity, or they can be deployed in different physical entities, or the central processing unit and a part of the PCIe core nodes can be deployed in the same physical entity, and another part of the PCIe core can be deployed in the same physical entity. The node is deployed in another physical entity, which is not specifically limited.

Based on the content shown in Figure 1, it can be seen that the root complex realizes the communication connection with other nodes in the PCIe core through each PCIe port set internally. In this case, the services between the root complex and the terminal APP are actually distributed Processing is performed on the communication link corresponding to each PCIe port. Therefore, ensuring the normality of the connection between each PCIe port and each node is crucial to maintaining the service processing capability of the entire PCIe core. At this stage, when a connection problem between a PCIe port and a downstream end node is detected, the usual practice is to restart the entire PCIe core. However, this method will make all PCIe ports in the PCIe core in an unavailable state, which not only cannot restore the services of the PCIe ports with connection problems, but also affects the services of other PCIe ports in the PCIe core, reducing the entire PCIe core and even reliability of the entire system. To solve this problem, in an optional implementation manner, the PCIe core may not be restarted, but only the PCIe port that has a connection problem with the downstream end node is disabled. However, this method will make the disabled PCIe port and all downstream nodes mounted on the PCIe port in an unavailable state. Although the services of other PCIe ports will not be affected, the services of the disabled PCIe port will not be affected. but could not recover for a long time. In addition, the above two methods can only deal with the connection failure when the downstream node is an end node, but cannot deal with the connection failure when the downstream node is other types of nodes (such as switching nodes or bridge nodes), which leads to the generality of fault handling. poor.

In view of this, the present application provides a fault handling method for quickly recovering the services of an abnormal PCIe port without affecting the services of other PCIe ports, and further realizing the processing of connection failures of more types of downstream nodes .

The specific implementation process of the fault handling solution in the present application is described below through specific embodiments.

[Example 1]

FIG. 2 exemplarily shows a schematic flowchart of a fault processing method provided by an embodiment of the present application, and the method is applicable to the central processing unit in FIG. 1 , such as the central processing unit 1 or the central processing unit 2 shown in FIG. 1 . As shown in Figure 2, the method includes:

Step 201, the central processing unit acquires abnormal interrupt information corresponding to the abnormal PCIe port.

In an optional implementation manner, the root complex can detect, in real time or periodically, the service processing status between each internal PCIe port and the downstream nodes mounted on each PCIe port. When the service processing between the downstream nodes is abnormal, in order to prevent the service accuracy of the PCIe core from being affected by continuing to execute the abnormal current service, the root complex can first interrupt the current service of the abnormal PCIe port, and then according to the current service of the fault. The relevant information generates abnormal interrupt information corresponding to the abnormal PCIe port, and finally reports it to the connected central processing unit. The abnormal interrupt information corresponding to the abnormal PCIe port may include the fault type, the identification of the abnormal PCIe port, and the identification of the PCIe core where the abnormal PCIe port is located, and may also include the type of the current service, the progress of service processing, or the information in the contact person information. one or more.

In the above embodiment, each interface in the PCIe core is PCIe, and the interface between the central processing unit and the root complex is not PCIe and does not belong to the PCIe core. In this case, the root complex can transmit the abort information to the central processor through a non-PCIe bus, such as through the file transfer protocol (FTP).

In this embodiment of the present application, the fault type of the abnormal PCIe port may include one or more of the following:

Error type 1: Two-bit data error

An error correcting code (ECC) can also be stored in the root complex. ECC can correct 1-bit errors (belonging to a correctable error (CE)) and detect 2-bit errors (belonging to an irreversible error). To correct errors (uncorrectable errors, UE)), data with only one bit error can be corrected into correct data, and data with two bit errors can be detected but cannot be corrected. In implementation, when the root complex detects that the service processing of a PCIe port is abnormal, it can call the ECC to preprocess the error first. The method locates the wrong 1-bit data by itself and can directly correct it to the correct data, and then continues to execute the current service of the PCIe port. In this case, the PCIe port is still a normal PCIe port. However, if there are 2 bits of data in error, the root complex can only detect which two bits are wrong but cannot correct itself. The PCIe port in this case is an abnormal PCIe port, and the root complex can suspend the PCIe port first. The current service of the port, and then assemble the corresponding abnormal interruption information based on the position of the 2-bit error located by the ECC and report it to the central controller. In this case, the root complex may label the failure type in the generated abort information as "two-bit data error".

Error type 2: Data link layer packet transmission timeout error

The root complex can be composed of an application layer, a transaction transport layer (TL), a data link layer (DLL) and a physical layer. When performing transaction processing between the central processor and the terminal APP, the application layer will first initiate a transaction transmission request to the transaction transport layer, and the transaction transport layer will generate the corresponding transaction transport layer package (transport layer package, TLP) and send it to the data link The data link layer adds a serial number and link cyclic redundancy check (LCRC) code to the TLP to generate the corresponding data link layer package (DLLP) and It is sent to the physical layer, and the transaction is transmitted on the PCIe link corresponding to the PCIe port in the physical layer. After the data link layer sends the DLLP, it will wait for the response information from the physical layer to return a successful transmission. Only when the response information is received within the preset time period, the data link layer will confirm the bidirectional connection between the data link layer and the physical layer. The transaction was received correctly. According to the transaction processing flow, in the implementation, if the data link layer in the root complex has not received the response information of successful transmission returned by the physical layer for more than a preset period of time after sending the DLLP, the root complex can confirm the data link There is a problem in the transmission between the road layer and the physical layer. The problem may be an uncorrectable error caused by network delay. In this case, the PCIe port is an abnormal PCIe port. The root complex may generate abnormal interrupt information corresponding to the PCIe port for the PCIe port, and may mark the failure type as "data link layer packet transmission timeout error".

Error type 3: Too many retries to write configuration space error

A node in a PCIe core (such as a root complex, switch node, or end node) can support up to 8 functions, such as audio, video, and more. When a node supports multiple functions at the same time, each function of the node has its own configuration space, and the relevant information of the function is stored in the configuration space. The configuration space may be an independent storage unit in the node, for example, the size of the configuration space may be 256k. Other nodes except the root complex can only see the relevant information of their own configuration space, and the root complex has the permission to read and write the configuration space of each node. For example, the root complex can read the information in the configuration space of any node through the transaction layer package to determine the functions supported by the node, or can write the configuration space of any node through the transaction layer package to complete the initialization and initialization of the node. Functional configuration. However, if the node being written to is not ready to respond to the root complex's request to write the configuration space, the node being written to will return the status to the root complex as "configuration retry status (CRS)" ” transaction layer response packet. This indicates that the root complex failed to successfully write to the node's configuration space. When the number of failures to write does not exceed a preset number of times, the failure to write is a correctable error, and the root complex can continue to be rewritten. When a certain number of transaction layer response packets in the CRS state are received, it indicates that the root complex has not been successfully written in a certain number of rewrites, and an uncorrectable error has occurred in the PCIe port on the PCIe link where the abnormal node is located. The PCIe port in this case is an abnormal PCIe port. The root complex can generate corresponding abnormal interrupt information for the PCIe port, and can mark the fault type as "data link layer packet transmission timeout error".

Error type 4: AXI bus response error

The root complex in the PCIe core is connected with the corresponding central processing unit through an advanced eXtensible interface (AXI) bus. It means that the root complex and the central processing unit need to send and receive data in the manner specified by the AXI bus protocol. In this case, after the root complex receives the data processing request for a certain PCIe port issued by the central processing unit through the AXI bus, if it cannot return a response to the central processing unit within the time specified by the AXI bus protocol, Or the returned response does not conform to the format specified by the AXI bus protocol, the root complex can determine that there is a problem with the PCIe port. The problem may be an uncorrectable error caused by the abnormality of the PCIe port, or an uncorrectable error caused by the abnormality of the transmission link between the PCIe port and the downstream node. In this case, the PCIe port is Abnormal PCIe port. The root complex can generate corresponding abnormal interrupt information for the PCIe port, and can mark the fault type as "AXI bus response error". It should be understood that the above content only takes the root complex detection of AXI bus response errors as an example to introduce this type of error. In other examples, AXI bus response errors can also be detected by the central processing unit without being reported by the root complex.

It should be noted that the above only exemplarily lists several possible fault types, and the fault type of an abnormal PCIe port may also be any other port-level uncorrectable error. In addition, the fault of the abnormal PCIe port may be caused by the hardware fault of the PCIe port and the downstream node, or may be caused by the software fault of the PCIe port and the downstream node, which is not specifically limited in this application.

Step 202, the central processing unit determines the fault type according to the abnormal interrupt information corresponding to the abnormal PCIe port: if the fault type is an unrecoverable fault, step 203 is performed; if the fault type is a recoverable fault, step 204 is performed.

In the embodiment of the present application, an unrecoverable fault refers to a fault that cannot be repaired directly or indirectly by the central processing unit and must be solved only by special personnel debugging, such as hardware damage or recording medium defect. Recoverable faults refer to faults that can be repaired directly or indirectly by the central processing unit, such as faults that can be repaired by reset, upgrade, update, download patch or restart.

Step 203, the CPU disables the abnormal PCIe port and the corresponding communication link.

In the embodiment of the present application, the central processing unit may further store a recoverable fault record table, and the recoverable fault record table records the recoverable fault types that the central processing unit can recover directly or indirectly. The recoverable faults in the recoverable fault record table may include, for example, two-bit data errors indicated above, data link layer packet transmission timeout errors, too many retries for writing configuration space, AXI bus response errors, or One or more of the other recoverable port-level errors. These recoverable failure types can be preset in the recoverable failure record table by R&D personnel based on their experience, or they can be learned and explored by the central processing unit in the process of executing the business and stored in the recoverable failure record table in real time. It may be obtained by the central processing unit from the interaction information obtained by other central processing units or network devices, etc., which is not specifically limited.

In implementation, the central processing unit may first obtain the fault type of the abnormal PCIe port from the abnormal interrupt information, and then match the fault type of the abnormal PCIe port with the recoverable fault type in the recoverable fault record table. All recoverable fault types in the table do not match the fault types of the abnormal PCIe port, and the central processing unit may locate the fault of the abnormal PCIe port as an unrecoverable fault. Since unrecoverable faults cannot be recovered by non-computer programs or operating techniques, nor can they be corrected by error checking codes or other techniques, once the CPU detects an unrecoverable fault on a PCIe port, it can directly disable it. The PCIe port is used to save the resources of the PCIe core and avoid the phenomenon that the unrecoverable fault of the abnormal PCIe port spreads to the entire PCIe core and causes the service failure of the entire PCIe core to occur.

In the embodiment of the present application, the central processing unit may disable the abnormal PCIe port and the corresponding PCIe link in various ways, for example:

In one way, the central processing unit can call up the configuration interface of the basic input output system (BIOS) through hot keys or instructions, select the abnormal PCIe port on the configuration interface and issue a disable command, the The disable command will cause the root complex to disable the port function of the abnormal PCIe port and the node functions of all nodes mounted on the abnormal PCIe port according to the instruction instruction or by writing the configuration space;

In another way, the central processing unit can call up the onboard setting interface of the RAM chip (such as a complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) chip) in the motherboard, and set the abnormal PCIe port in the onboard setting interface The slot is set to "Disabled", so that the root complex fails the abnormal PCIe port, thereby unloading all nodes mounted on the abnormal PCIe port.

Exemplarily, when the abnormal interruption information also includes contact person information, after disabling the PCIe port and the corresponding PCIe link, the central processing unit can also generate a corresponding alarm message, and according to the contact person information in the abnormal interruption information The alarm information is pushed to the user, so that the user can know and repair the faulty PCIe port in time, and restore the services of the PCIe port as soon as possible.

Step 204, the CPU resets the abnormal PCIe port and the corresponding communication link.

In the embodiment of the present application, the central processing unit may also store the reset method of each recoverable fault recorded in the recoverable fault record table, and the reset method of any recoverable fault may be reset, upgrade, update, download One or more of a patch or reboot. In implementation, if there is a target recoverable fault type matching the fault type of the abnormal PCIe port in the recoverable fault record table, the central processing unit may locate the fault of the abnormal PCIe port as a recoverable fault, and may use the target recoverable fault The reset method corresponding to the recovery fault type resets the abnormal PCIe port and the communication link corresponding to the abnormal PCIe port. In this way, the central processing unit can not only restore the services of the abnormal PCIe port as soon as possible by resetting, but also can maintain the reliability of the PCIe core without affecting the services of other PCIe ports.

It should be noted that, in general, only the root complex in the PCIe core can directly write the configuration space of the ports and nodes in the PCIe core, while the CPU cannot directly write the configuration space of the ports or nodes in the PCIe core. Therefore, In order to reset the abnormal PCIe port and the corresponding communication link, this method can synchronize the permission to write the configuration space of the port and the node in the PCIe core to the central processing unit in advance, so that the central processing unit can directly write the configuration space of the configuration space through the central processing unit. The reset can also be accomplished by an indirect method in which the central processing unit sends a corresponding instruction to the root complex to drive the root complex to write the configuration space, which is not specifically limited.

A specific implementation manner of resetting an abnormal PCIe port and a corresponding communication link is described below by taking the reset manner as a reset as an example.

FIG. 3 exemplarily shows a schematic flowchart corresponding to a reset method provided by an embodiment of the present application, and the method is applicable to a central processing unit, such as the central processing unit 1 or the central processing unit 2 shown in FIG. 1 . In this example, it is assumed that the CPU indirectly writes the configuration space of ports and nodes in the PCIe core. As shown in Figure 3, the method includes:

Step 301, the CPU deactivates the abnormal PCIe port.

In the above step 301, the central processing unit may send a deactivation instruction for the abnormal PCIe port to the root complex, so as to instruct the root complex not to continue the current service of the abnormal PCIe port. The root complex can also record the service processing progress of the abnormal PCIe port before the deactivation after deactivating the abnormal PCIe port, so as to continue to execute the service after recovery. By deactivating the abnormal PCIe port before resetting the abnormal PCIe port, it can avoid affecting the services executed during the process of resetting the abnormal PCIe port, and ensure the accuracy of service execution before and after resetting the abnormal PCIe port.

Step 302, the central processing unit disconnects the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port.

In the above step 302, the central processing unit may send a removal instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of each node mounted on the abnormal PCIe port, and attach the abnormal PCIe port to the The loaded nodes are removed from the current communication link, and the connection relationship between the abnormal PCIe port and each node mounted on the abnormal PCIe port is disconnected. By removing each node mounted on the abnormal PCIe port before resetting the abnormal PCIe port, the abnormal PCIe port can be decoupled from other nodes in the PCIe core, which is helpful to realize the independent reset of the abnormal PCIe port.

Step 303, the CPU fails the abnormal PCIe port.

In the above step 303, the central processing unit may send an invalidation instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of the abnormal PCIe port, and switch the abnormal PCIe port from the enable state to To enable (disable) state. Wherein, "enable" means "enable". When the abnormal PCIe port is in the enabled state, the abnormal PCIe port has the ability to process the data sent by the upstream root complex or the data reported by the downstream node. When the abnormal PCIe port is in the disabled state, the abnormal PCIe port does not have the ability to process the data sent by the upstream root complex or the data reported by the downstream node.

Step 304, the central processing unit resets the media access control (media access control, MAC) logic of the PCIe core.

In the above step 304, the central processing unit may send a MAC reset command for the abnormal PCIe port to the root complex, so as to drive the root complex to initialize the MAC logic of the entire PCIe core and restore the MAC layer communication mechanism of the abnormal PCIe port. By only resetting the MAC layer logic related to the port without resetting other logic unrelated to the port, the abnormality of the abnormal PCIe port can be recovered in a targeted manner, and on the basis of improving the efficiency of PCIe port reset, saving Processing resources of the CPU and PCIe cores.

Step 305, the CPU resets the SerDes link parameter corresponding to the PCIe core.

In the embodiment of the present application, each communication link in the PCIe core is managed by a serializer/deserializer (SerDes). The SerDes is preset with default SerDes link parameters, and the SerDes follows the default SerDes The link parameters convert parallel transmit data to serial transmit data, or convert serial receive data to parallel receive data according to the default SerDes link parameters. However, when the surrounding environment changes, the default SerDes link parameters may no longer be suitable for the PCIe core, resulting in an abnormality in the communication link where some PCIe ports in the PCIe core are located. In this case, the SerDes link parameters in the SerDes need to be adjusted to suit the current environment.

In implementation, the central processing unit may send a link reset command for the abnormal PCIe port to the root complex, so as to drive the root complex to adaptively calibrate SerDes link parameters corresponding to the abnormal PCIe port. The calibrated SerDes link parameters may be obtained by obtaining the current environmental parameters (such as temperature) and substituting them into the preset formula for calculation, or may be obtained by performing closed-loop feedback adjustment according to the adjusted execution effect until it converges. It may also be randomly selected, which is not specifically limited.

Step 306, the central processing unit rebuilds the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port.

In the above-mentioned step 306, the central processing unit may send a rebuild chain instruction for the entire PCIe core to the root complex, so that the root complex rebuilds the entire topology of the PCIe core, or it may send an instruction to the root complex only for exceptions The rebuild chain instruction of the PCIe port enables the root complex to directly detect the abnormal PCIe port and all nodes mounted on the abnormal PCIe port, and then add it to the topology of the existing PCIe core.

In an optional implementation manner, continuing to refer to FIG. 1 , assuming that the entire PCIe core 1 of the chain is to be rebuilt, the root complex 1 can sequentially traverse the bus paths where the PCIe ports RP1, RP2, and RP3 set internally are located:

Root complex 1 traverses the downstream bus of RP1, finds switching node 1, and assigns a bus address to switching node 1 (which may include bus number, node number, and function number (bus device and function number, BDF), etc.); root complex 1 follows The depth-first rule continues to traverse the nodes connected to the downstream bus of switch node 1, finds end node 2, and assigns a bus address to end node 2; since end node 2 does not have a downstream bus, root complex 1 continues to traverse the downstream of switch node 1. The node connected to the bus finds end node 3 and assigns a bus address to end node 3; since end node 3 does not have a downstream bus, root complex 1 continues to traverse the nodes connected to the downstream bus of switch node 1 and finds end node 4 , assign the bus address to the end node 4; so far, the bus link traversal where RP1 is located is completed;

Root complex 1 traverses the downstream bus of RP2, finds end node 1, and assigns a bus address to end node 1; so far, the traversal of the bus link where RP2 is located is completed;

Root complex 1 traverses the downstream bus of RP3, finds bridge node 1, and assigns a bus address to bridge node 1; root complex 1 records the bridge address of bridge node 1 at the same time, so as to establish an association with the topology obtained by PCIe core 2 traversal; So far, the traversal of the bus link where RP3 is located is completed.

Through the above process, the root complex 1 can allocate bus addresses to all nodes in the PCIe core 1, and can construct the topology of the entire PCIe core 1.

It should be noted that since only the nodes mounted on the abnormal PCIe ports that were removed before are not in the topology corresponding to the PCIe core, the nodes mounted on other PCIe ports other than the abnormal PCIe ports themselves exist in PCIe In the topology structure corresponding to the core, therefore, even if the chain is reconstructed for the entire PCIe core, the nodes mounted on the non-abnormal PCIe ports traversed by the root complex do not need to be re-added to the topology structure of the PCIe core. Instead, only the nodes mounted on the abnormal PCIe port that are not in the current topology can be added, so as to complete the reconstruction of the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port by supplementing the link. .

Step 307 , the CPU activates the abnormal PCIe port.

In the above step 307, the central processing unit may send an effective instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of the abnormal PCIe port, and switch the abnormal PCIe port from the disabled state to the disabled enable state to restore the abnormal PCIe port's ability to process the data sent by the upstream root complex or the data reported by the downstream node.

Step 308, the CPU enables the abnormal PCIe port.

In the above step 308, the central processing unit may send an enable instruction for the abnormal PCIe port to the root complex, so as to instruct the root complex to enable the service processing of the abnormal PCIe port. Among them, the root complex can re-execute the current service of the abnormal PCIe port, and can also continue to execute the current service of the abnormal PCIe port from the service processing progress recorded when it was deactivated, so as to save the processing resources of the PCIe core and improve the service at the same time. For processing efficiency, the current service may not be executed, but the new service may be directly executed after the subsequent new service arrives, which is not specifically limited.

In the above embodiment 1, by resetting the recoverable abnormal PCIe port and the corresponding communication link, not only the availability of the abnormal PCIe port can be restored in time, but the communication service capability of the PCIe port can be maintained, and the entire PCIe core need not be restarted. The processing of the abnormal PCIe port will not affect the service processing of other PCIe ports, thereby helping to maintain the high reliability of the PCIe system. Further, this solution can also handle any recoverable faults of any PCIe node (such as end node, switch node or bridge node) that appear in the PCIe core, not just the end node or a certain type of failure, so This method can also effectively detect and repair various faults in the PCIe core, further improving the high reliability of the entire PCIe core.

It should be noted that, the above-mentioned first embodiment actually takes the central processing unit as the execution body as an example to introduce the specific implementation process of the fault handling, which is only an optional implementation manner. In an alternative embodiment, the fault handling scheme can also be executed directly by the root complex. In this embodiment, if the root complex detects that an internal PCIe port is abnormal, the PCIe port and the corresponding communication link can be directly reset without reporting to the central processing unit for processing. Although this implementation increases the working pressure of the root complex, it can save the communication overhead between the PCIe core and the central processing unit, and can handle faults faster.

In this embodiment of the present application, the central processing unit may run in a Linux operating system. The virtual memory space in the Linux operating system is divided into kernel space (kernel space) and user space (user space). Kernel space is the running space of the kernel in the Linux operating system, while user space is the running space of user programs. Kernel space and user space are isolated from each other, and even if the user program crashes, the kernel space will not be affected.

Based on the Linux operating system, FIG. 4 exemplarily shows a schematic diagram of the software and hardware architecture of a fault processing logic provided by an embodiment of the present application. As shown in FIG. 4 , the fault processing solution can be implemented in hardware by a central processing unit provided in the chip hardware. Interaction with the root complex is done. Chip hardware can also be connected to other peripherals, such as memory, input and output devices, or drive devices. The overall logic of the software for fault handling is encapsulated in the central processing unit. advanced configuration and power management interface, ACPI) and SerDes firmware. Among them, the program of the RAS firmware or SerDes firmware can be pre-written into the memory, the RAS firmware is mainly used for processing interrupts, and the SerDes firmware is mainly used for rebuilding the chain. The central processing unit may execute the method logic corresponding to the RAS firmware or the SerDes firmware by calling the RAS firmware or the SerDes firmware in the memory. ACPI defines various working interfaces between the operating system, BIOS and system hardware. ACPI can be implemented in the BIOS or system hardware and can be invoked or triggered by the operating system. The PCIe driver is located in the kernel space of the Linux system, and is used to manage the enabling or disabling of each port in the PCIe core and the connection relationship between each port and each node. PCIe drivers can be open sourced to the community.

It can be understood that the memory in this embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) ) and direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.

It should be understood that FIG. 4 is only an exemplary introduction, and in other examples, the central processing unit and the root complex may also be provided in different chip hardware. In addition, the memory and the central processing unit may be located in the same physical entity (eg, chip hardware), or may be located in different physical entities.

The specific implementation of the fault handling solution is further introduced in Embodiment 2 from the perspective of the execution of the above-mentioned software and hardware architecture.

[Example 2]

FIG. 5 exemplarily shows a schematic flowchart of another fault handling method provided by an embodiment of the present application. The method is applicable to chip hardware, RAS firmware, ACPI, PCIe driver, and SerDes firmware. The chip hardware is provided with a central processing unit and an exception. PCIe ports, PCIe drivers are located in the kernel space, and the kernel space interacts with the user space. As shown in Figure 5, the method includes:

Step 501, when an abnormal PCIe port fails, trigger a RAS interrupt.

In the above step 501, when the fault of the abnormal PCIe port is a correctable fault, the fault can be repaired by the root complex before triggering the RAS interruption. However, when the fault of the abnormal PCIe port is an uncorrectable fault, since the root complex cannot repair the fault, the fault will trigger the root complex to interrupt the current service of the abnormal PCIe port and report the current RAS interruption to the CPU. Relevant information, such as the identification of the abnormal PCIe port (core port ID, such as number), the fault type (error type), and the identification of the PCIe core where the abnormal PCIe port is located (core ID, such as number), etc.

Step 502, the RAS firmware calls ACPI to generate a corresponding ACPI interrupt event according to the RAS interrupt and reports it to the PCIe driver.

In the above step 502, the central processing unit will call the RAS firmware to process the RAS interrupt, generate the corresponding ACPI interrupt event according to the relevant information of the RAS interrupt and the relevant information (such as the identification) of the central processing unit, and report it to the kernel PCIe driver in space.

Step 503 , the PCIe driver generates fault information corresponding to the abnormal PCIe port according to the ACPI interrupt event, and adds the fault information to a work queue (ie, a preset work queue).

In the above-mentioned step 503, the PCIe driver will respond to the ACPI interrupt event related to the PCIe core, firstly extract the identification of the abnormal PCIe port, the fault type and the identification of the PCIe core where the abnormal PCIe port is located from the ACPI interrupt event, and then according to these The information and the identification of the chip (socket ID, such as number) set by the PCIe core are assembled into fault information according to the set message structure, and then added to the work queue. The work queue is located in the kernel space and is used to store each fault information that occurs in the PCIe core. In implementation, the central processing unit may set a corresponding work queue for each PCIe core in the kernel space, and the PCIe driver in any central processing unit adds fault information that occurs in the PCIe core connected to the central processing unit to the PCIe core The work queue corresponding to the core, or, the central processing unit can also set the same work queue for all PCIe cores in the kernel space, and the PCIe driver in each central processing unit will record the faults that occur in the PCIe cores connected to each central processing unit. The information is all added to the same work queue, or the central processing unit may also set a work queue for some PCIe cores in the kernel space, another work queue for another part of the PCIe core, and so on.

Step 504 , the PCIe driver takes out a piece of fault information from the work queue, and determines whether the fault type in the fault information is a recoverable fault, if not, executes step 505 , if yes, executes step 506 .

In the above step 504, the PCIe driver can take out the fault information from the work queue in the order of first-in, first-out, so as to restore each PCIe port in order from early to late according to the time of the fault, or it can also start from the first-in-last-out method. The fault information is retrieved from the work queue to recover the most recent PCIe port fault first and then the earlier PCIe port fault. The fault information can also be retrieved from the work queue in order of the fault degree from heavier to lighter. There is no specific limitation.

Exemplarily, the central processing unit includes multiple processes, such as including multiple central processing unit cores, and the PCIe driver may also call the multiple central processing unit cores to comprehensively process each piece of fault information. For example, the PCIe driver can pre-allocate each fault information to multiple CPU cores in a balanced manner, or it can call the most idle CPU core to process a piece of fault information to be processed, and can also allocate a piece of fault information to be processed. It is not limited to the CPU core that is best at processing the current business.

Step 505, the PCIe driver pushes the fault information to the user space.

In the above step 505, when the PCIe driver cannot recover the abnormal PCIe fault by itself, by pushing the fault to the user, the user can be notified in time to facilitate early manual maintenance and prevent the PCIe port from being in an unavailable state for a long time.

Exemplarily, each time the PCIe driver adds a piece of fault information to the work queue, it can also push a log message to the user space, so that the user can know the abnormal situation of the entire PCIe core and the current working pressure of the central processing unit in time.

Step 506, the PCIe driver in conjunction with the ACPI and SerDes firmware executes the overall logic of fault handling, and the execution process includes the following steps:

Step 5061, the PCIe driver calls the PCIe public interface to disable the abnormal PCIe port;

Step 5062, the PCIe driver calls the PCIe public interface to remove all nodes mounted on the abnormal PCIe port;

In the above steps 5061 and 5062, the PCIe public interface may be located in a public register, and the methods in the public register may be open sourced to the community and visible to other users. The implementation manner of the PCIe public interface may refer to the existing logic, which will not be repeated here.

Step 5063, the PCIe driver invokes the port reset method in the ACPI interface.

In the above step 5063, the port reset method is programmed according to the programming language specified by the ACPI interface, and added as an interface in the ACPI interface logic. In this way, the PCIe driver can execute the corresponding port reset logic by calling the interface name corresponding to the port reset method. Among them, the interface name of the port reset method can be set by yourself, for example, it is set to RP reset.

Step 5064, the PCIe driver first resets the MAC logic of the PCIe core according to the port reset method;

Step 5065, the PCIe driver resets the SerDes link parameters corresponding to the PCIe core by calling the SerDes firmware according to the port reset method;

Exemplarily, the port reset method may be located in the common register of the chip, and the PCIe driver directly implements the invocation of the PCIe public interface and the port reset method in the common register of the chip through in-chip messages, thereby reducing message transmission channels between chips. Alternatively, the port reset method can also be located in the private register of the chip. After the PCIe driver calls the PCIe public interface in the public register of the chip, it then calls the port reset method in the private register of the chip through the message calling method in the chip to protect privacy. Port reset method.

Step 5066, the PCIe driver resets the SerDes link parameters corresponding to the PCIe core according to the SerDes firmware;

In the above step 5066, the SerDes firmware can be programmed according to any operating language, such as C++, Phython, and so on. When the SerDes link is reset, the SerDes link set in the chip will receive the reset command driven by the PCIe driver, and then the parameters of the SerDes link will be calibrated according to the current environment, and the parameters will be adaptively calibrated again according to the execution effect of the calibrated parameters. , in order to try to restore the SerDes link corresponding to the abnormal PCIe port in the PCIe core to the best state.

Illustratively, the SerDes firmware may be located in a private register of the chip. Since the PCIe driver implements the SerDes link reset method by calling the SerDes firmware in the port reset method, even if the port reset method is open sourced to the community, the SerDes link reset method provided by the SerDes firmware is not visible to the outside world, which helps maximize the Ensure the security of the SerDes link reset method.

Step 5067, the PCIe driver determines that the SerDes firmware invocation is completed, and returns to the port reset method;

Step 5068, the PCIe driver determines that the port reset method call is completed, and returns to the PCIe public interface;

Step 5069, the PCIe driver rebuilds the topology of the PCIe core by enumerating and traversing.

In the above-mentioned second embodiment, through the cooperation of chip hardware, chip firmware (including RAS firmware and SerDes firmware), PCIe driver and ACPI, the central processing unit can complete the fault perception, reset and service of the PCIe port through the combination of software and hardware. recover. In this method, the recovery of the failure of one PCIe port will not affect the normal services of other PCIe ports. Moreover, even if the PCIe fault recovery driver is open-sourced to the community, it will not expose the SerDes firmware set in the chip's private register, or the port reset method and SerDes firmware. By protecting the port reset logic as much as possible, the probability that the reset process of an abnormal PCIe port is disturbed by the outside world can be reduced, and the reset accuracy can be improved.

According to the foregoing method, FIG. 6 is a schematic structural diagram of a fault processing apparatus 600 provided by an embodiment of the present application, and the fault processing apparatus 600 may be a chip or a circuit, such as a chip or a circuit that may be provided in a central processing unit. The fault processing device 600 may correspond to the central processing unit in the above method. The fault handling apparatus 600 may implement any one or more of the corresponding method steps as shown in FIG. 2 to FIG. 5 . As shown in FIG. 6 , the fault handling device 600 may include a monitoring circuit 601 and a processing circuit 602. Further, the fault processing device 600 may further include a bus system, and the monitoring circuit 601 and the processing circuit 602 may be connected through the bus system. Moreover, the monitoring circuit 601 can also be connected to each PCIe port in the PCIe core through the bus system, and the processing circuit 602 can also be connected to the root complex in the PCIe core through the bus system.

In this embodiment of the present application, the monitoring circuit 601 may receive the abnormal interrupt information reported by the abnormal PCIe port and send it to the processing circuit 602 . Correspondingly, the processing circuit 602 can first determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information, and reset the abnormal PCIe port and the communication link of the abnormal PCIe port when the fault type corresponding to the abnormal PCIe port is a recoverable fault. , to restore the connectivity between the abnormal PCIe port and the PCIe node.

For the concepts related to the technical solutions provided by the embodiments of the present application involved in the fault processing apparatus 600, please refer to the descriptions of the foregoing methods or other embodiments for explanations, detailed descriptions, and other steps, which will not be repeated here.

According to the foregoing method, FIG. 7 is a schematic structural diagram of another fault processing apparatus 700 provided by an embodiment of the present application. The fault processing apparatus 700 may be a chip or a circuit, such as a chip or a circuit that may be provided in a central processing unit. The fault processing device 700 may correspond to the central processing unit in the above method. The fault handling apparatus 700 may implement any one or more of the corresponding method steps shown in FIG. 2 to FIG. 5 . As shown in FIG. 7 , the fault processing apparatus 700 may include a communication interface 701 , a determination unit 702 and a processing unit 703 .

In this embodiment of the present application, the communication interface 701 may be a receiving unit or a receiver when receiving information, and the receiving unit or receiver may be a radio frequency circuit. In a specific implementation, the communication interface 701 can receive the abnormal interruption information reported by the abnormal PCIe port, and the determining unit 702 can determine the fault type corresponding to the abnormal PCIe port according to the abnormal interruption information. In the case where the fault type corresponding to the abnormal PCIe port is a recoverable fault Next, the processing unit 703 may reset the abnormal PCIe port and the communication link of the abnormal PCIe port to restore the connection relationship between the abnormal PCIe port and the PCIe node.

For the concepts related to the technical solutions provided by the embodiments of the present application involved in the fault processing apparatus 700, please refer to the descriptions of the foregoing methods or other embodiments for explanations and detailed descriptions and other steps, and will not be repeated here.

It can be understood that, the functions of each unit in the above-mentioned fault processing apparatus 700 may refer to the implementation of the corresponding method embodiments, which will not be repeated here.

It should be understood that the division of the units of the above fault processing apparatus 700 is only a division of logical functions, and in actual implementation, all or part of them may be integrated into one physical entity, or may be physically separated. In this embodiment of the present application, the communication interface 701 may be implemented by the monitoring circuit 601 in the foregoing FIG. 6 , and the determining unit 702 and the processing unit 703 may be implemented by the processing circuit 602 in the foregoing FIG. 6 .

According to the method provided by the embodiment of the present application, the present application further provides a fault processing system, where the fault processing system includes the central processing unit and the PCIe core described in any of the foregoing contents. The PCIe core includes a root complex and at least one PCIe node, and the root complex is connected to a downstream PCIe node through at least one PCIe port set inside. The central processing unit may execute the method of any one of the embodiments shown in FIG. 1 to FIG. 5 to implement fault processing for an abnormal PCIe port in the at least one PCIe port.

According to the method provided by the embodiment of the present application, the present application also provides a computer program product, the computer program product includes: computer program code, when the computer program code is run on a computer, the computer is made to execute the steps shown in FIG. 1 to FIG. 5 . The method of any one of the illustrated embodiments.

According to the method provided by the embodiment of the present application, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores program codes, and when the program codes are run on a computer, the computer is made to execute FIG. 1 to FIG. 5 . The method of any one of the illustrated embodiments.

The terms "component", "module", "system" and the like are used in this specification to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be components. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.

Those of ordinary skill in the art will appreciate that the various illustrative logical blocks and steps described in connection with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware accomplish. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

A fault handling method, characterized in that the method comprises:

Obtain the abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port;

Determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information;

In the case that the fault type corresponding to the abnormal PCIe port is a recoverable fault, resetting the abnormal PCIe port and the communication link of the abnormal PCIe port;

Wherein, the communication link of the abnormal PCIe port is used to connect the abnormal PCIe port and the PCIe node.
The method of claim 1, wherein the PCIe node is an end node, a switch node or a bridge node.
The method according to claim 1 or 2, wherein the resetting the abnormal PCIe port comprises:

The MAC logic of the medium access control layer of the PCIe core where the abnormal PCIe port is located is reset.
The method according to any one of claims 1 to 3, wherein the resetting the communication link of the abnormal PCIe port comprises:

Reset the serializer/demodulator SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located.
The method of claim 4, wherein:

Before resetting the serializer/demodulator SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located, it also includes:

Disconnecting the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port;

After resetting the serializer/demodulator SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located, it also includes:

Rebuild the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port.
The method according to any one of claims 1 to 5, wherein the method further comprises:

In the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, the abnormal PCIe port and the communication link of the abnormal PCIe port are disabled.
The method according to any one of claims 1 to 6, wherein the recoverable fault includes one or more of the following:

Data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error, advanced extensible interface AXI bus response error.
The method according to any one of claims 1 to 7, wherein the abnormal interrupt information corresponding to the abnormal PCIe port includes one or more of the following contents:

The identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, the identifier of the PCIe core where the abnormal PCIe port is located, and the identifier of the central processing unit CPU to which the PCIe core is connected.
The method according to any one of claims 1 to 8, wherein the acquiring abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port comprises:

Obtain the abnormal interrupt information corresponding to the abnormal PCIe port from the preset work queue;

The preset work queue is used for storing abnormal interrupt information corresponding to abnormal PCIe ports in each PCIe core.
A fault handling device, characterized in that it comprises a processor and a memory, wherein a computer program is stored in the memory;

The processor performs the following operations by calling the computer program stored in the memory:

Obtain the abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port;

Determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information;

In the case that the fault type corresponding to the abnormal PCIe port is a recoverable fault, resetting the abnormal PCIe port and the communication link of the abnormal PCIe port;

Wherein, the communication link of the abnormal PCIe port is used to connect the PCIe port and the PCIe device.
The apparatus of claim 10, wherein the PCIe node is an end node, a switch node or a bridge node.
The device according to claim 10 or 11, wherein the fault handling device further comprises an advanced configuration and power management interface ACPI;

The processor specifically performs the following operations by calling the computer program stored in the memory:

By calling the ACPI, execute: reset the MAC logic of the medium access control layer of the PCIe core where the abnormal PCIe port is located.
The device of claim 12, wherein the fault handling device further comprises a serial demodulator SerDes firmware;

The processor also performs the following operations by calling the computer program stored in the memory:

By calling the ACPI, execute: after resetting the medium access control layer MAC logic of the PCIe core where the abnormal PCIe port is located, calling the SerDes firmware;

By calling the SerDes firmware, execute: reset the SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located.
The apparatus of claim 13, wherein the fault handling apparatus further comprises a PCIe driver;

The processor also performs the following operations by calling the computer program stored in the memory:

By calling the PCIe driver, execute: disconnect the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port, and call the ACPI;

By calling the SerDes firmware, execute: after resetting the SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located, return to calling the PCIe driver;

After returning to call the PCIe driver, execute: rebuild the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port.
The apparatus of claim 14, wherein the memory includes a public register and a private register;

The PCIe driver and the ACPI are stored in the public register, and the SerDes firmware is stored in the private register; or,

The PCIe driver is stored in the public register, and the ACPI and SerDes firmware are stored in the private register.
The apparatus according to any one of claims 10 to 15, wherein the processor further performs the following operations by calling the computer program stored in the memory:

In the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, the abnormal PCIe port and the communication link of the abnormal PCIe port are disabled.
The apparatus according to any one of claims 10 to 16, wherein the recoverable fault includes one or more of the following:

Data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error, advanced extensible interface AXI bus response error.
The apparatus according to any one of claims 10 to 17, wherein the abnormal interrupt information corresponding to the abnormal PCIe port includes one or more of the following:

The identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, the identifier of the PCIe core where the abnormal PCIe port is located, and the identifier of the central processing unit CPU to which the PCIe core is connected.
The apparatus of any one of claims 10 to 18, further comprising a communication interface;

The processor specifically performs the following operations by calling the computer program stored in the memory:

receiving the abnormal interrupt information corresponding to the abnormal PCIe port through the communication interface;

adding the abnormal interrupt information corresponding to the abnormal PCIe port to a preset work queue; the preset work queue is used to store abnormal interrupt information corresponding to the abnormal PCIe port in each PCIe core;

Acquire abnormal interrupt information corresponding to the abnormal PCIe port from the preset work queue.
A fault handling system, characterized in that it includes a central processing unit and a peripheral component interconnecting and transmitting a PCIe core, the PCIe core includes a root complex and at least one PCIe node, and the central processing unit is connected to the root complex; The root complex includes at least one PCIe port, and the root complex is connected to the at least one PCIe node through the at least one PCIe port;

The root complex is used to generate abnormal interrupt information corresponding to an abnormal PCIe port in the at least one PCIe port and report it to the central processing unit;

The central processing unit is configured to perform fault processing on the abnormal PCIe port according to the fault processing method according to any one of claims 1 to 9.
A computer-readable storage medium, characterized in that the computer-readable medium stores a program code, which, when the program code is executed on a computer, causes the computer to execute the method described in any one of claims 1 to 9. method described.
A computer program product, characterized in that it includes computer program code, which, when executed on a computer, causes the computer to perform the method according to any one of claims 1 to 9.