WO2022155919A1 - Fault handling method and apparatus, and system - Google Patents

Fault handling method and apparatus, and system Download PDF

Info

Publication number
WO2022155919A1
WO2022155919A1 PCT/CN2021/073396 CN2021073396W WO2022155919A1 WO 2022155919 A1 WO2022155919 A1 WO 2022155919A1 CN 2021073396 W CN2021073396 W CN 2021073396W WO 2022155919 A1 WO2022155919 A1 WO 2022155919A1
Authority
WO
WIPO (PCT)
Prior art keywords
pcie
abnormal
port
pcie port
fault
Prior art date
Application number
PCT/CN2021/073396
Other languages
French (fr)
Chinese (zh)
Inventor
胡成
董钰山
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/073396 priority Critical patent/WO2022155919A1/en
Priority to CN202180090841.4A priority patent/CN116724297A/en
Publication of WO2022155919A1 publication Critical patent/WO2022155919A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance

Definitions

  • the present application relates to the field of communication technologies, and in particular, to a fault handling method, device, and system.
  • PCIe Peripheral component interconnect express
  • the communication interface can quickly read and write memory, and can support ultra-high-bandwidth communication. It has been widely used in various fields such as network, communication, storage, industrial and consumer electronic products.
  • the main components of a PCIe system include a root complex (root complex, RC), a switch node (switch), and an end node (endpoint, EP).
  • the root complex is used to manage all buses and all nodes in the PCIe system, and is a bridge for communication between nodes in the PCIe system.
  • a root complex may contain multiple PCIe ports, and the root complex is respectively connected to multiple nodes, such as multiple end nodes or multiple switching nodes, through multiple PCIe ports.
  • a switch node can be used to connect the root complex and other switch nodes, or to connect the root complex and end nodes, and is a data forwarding node in the PCIe system.
  • An end node is an end device, such as a peripheral device, for receiving data or sending data.
  • the present application provides a fault handling method, device and system to solve the technical problem of low reliability of the PCIe system caused by restarting the entire PCIe system to restore an abnormal PCIe port in the prior art.
  • the present application provides a fault handling method, which is applicable to a central processing unit, and the central processing unit can be directly or indirectly connected to each PCIe port in a PCIe system.
  • the method includes: after the central processing unit receives the abnormal interruption information reported by the abnormal PCIe port, firstly determines the fault type corresponding to the abnormal PCIe port according to the abnormal interruption information, and when it is determined that the fault type is a recoverable fault, resetting the abnormal PCIe port Port and the communication link of the abnormal PCIe port.
  • the communication link of the abnormal PCIe port is used to connect the abnormal PCIe port and the PCIe node.
  • the abnormal PCIe port is provided in the root complex, and the PCIe node can be any node connected to the root complex, such as an end node, a switch node, or a bridge node.
  • the scheme can not only detect the link abnormality of the end node, but also detect the link abnormality of other types of nodes (such as switching nodes or bridge nodes).
  • Various types of link anomalies can maintain the availability of the PCIe system to the greatest extent possible.
  • the central processing unit may reset the abnormal PCIe port by resetting the media access control layer (media access control, MAC) logic of the PCIe core where the abnormal PCIe port is located.
  • media access control layer media access control, MAC
  • the abnormality of the abnormal PCIe port can be recovered in a targeted manner, which improves the efficiency of the PCIe port reset.
  • the processing resources of the central processing unit and PCIe core are saved.
  • the central processing unit can realize the communication link to the abnormal PCIe port by resetting the serializer/deserializer (SerDes) link parameter corresponding to the PCIe core where the abnormal PCIe port is located reset.
  • SerDes serializer/deserializer
  • the design can calibrate the current SerDes link parameters of the abnormal PCIe port when the surrounding environment changes, and restore the communication quality of the abnormal PCIe port by adjusting them to parameters suitable for the current environment.
  • the CPU before resetting the serializer/demodulator SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located, the CPU can also disconnect the abnormal PCIe port and mount it on the abnormal PCIe port After resetting the serializer/demodulator SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located, rebuild the abnormal PCIe port and the PCIe mounted on the abnormal PCIe port Communication link between nodes.
  • the abnormal PCIe port can be decoupled from other nodes in the PCIe core, which is helpful to realize the independent reset of the abnormal PCIe port .
  • the CPU determines that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, it can disable the abnormal PCIe port and the communication link of the abnormal PCIe port to save the resources of the PCIe core and try to avoid abnormality
  • the phenomenon occurs that the unrecoverable fault of the PCIe port spreads to the entire PCIe core and causes the service failure of the entire PCIe core.
  • recoverable failures may include one or more of the following: data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error ,
  • the response of the advanced extensible interface AXI bus is wrong.
  • the design can repair various types of errors related to PCIe ports, such as data link layer packet transmission timeout errors and transaction layer packet write configuration space too many retries errors related to PCIe node communication, and self data. Storage-related two-bit data errors, and AXI bus response errors related to CPU transfers help maintain PCIe port availability more fully.
  • the abnormal interrupt information corresponding to the abnormal PCIe port may include one or more of the following contents: the identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, and the PCIe core where the abnormal PCIe port is located. , and the ID of the central processing unit (CPU) connected to the PCIe core.
  • the central processing unit can obtain some features related to the current abnormality, so as to calculate the fault type corresponding to the abnormal PCIe port.
  • the central processing unit may obtain abnormal interrupt information corresponding to the abnormal PCIe port from a preset work queue, where the predetermined work queue stores abnormal interrupt information corresponding to the abnormal PCIe port in each PCIe core.
  • the design can centrally manage port exceptions occurring in each PCIe core through a preset work queue, effectively improving the flexibility of recovery of abnormal PCIe ports in each PCIe core.
  • the present application provides a fault handling device, including a processor and a memory, and a computer program is stored in the memory.
  • the processor can perform the following operations: obtain abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port, determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information, If the fault type corresponding to the PCIe port is a recoverable fault, reset the abnormal PCIe port and the communication link of the abnormal PCIe port.
  • the communication link of the abnormal PCIe port is used to connect the PCIe port and the PCIe device.
  • a PCIe node may be one or more of an end node, a switch node, or a bridge node.
  • the fault handling device may also include an advanced configuration and power management interface (ACPI).
  • ACPI advanced configuration and power management interface
  • the processor can execute the MAC logic of the media access control layer of the PCIe core where the abnormal PCIe port is located.
  • the fault handling device may further include SerDes firmware of a serial demodulator.
  • the processor may also call the MAC logic of the media access control layer of the PCIe core where the abnormal PCIe port is located after resetting the MAC logic of the PCIe core.
  • the SerDes firmware can reset the SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located by calling the SerDes firmware.
  • the fault handling device may further include a PCIe driver.
  • the processor can disconnect the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port, and call ACPI.
  • the SerDes firmware By calling the SerDes firmware, it can reset the PCIe core corresponding to the abnormal PCIe port. After setting the SerDes link parameters, and returning to call the PCIe driver, by returning to call the PCIe driver, the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port can be rebuilt.
  • the memory can include public and private registers:
  • PCIe driver and ACPI can be stored in public registers
  • SerDes firmware can be stored in private registers, so as to privacy the link reset method in SerDes firmware and effectively protect the implementation logic of SerDes link parameter reset;
  • the PCIe driver can be stored in the public register
  • the ACPI and SerDes firmware can be stored in the private register, so as to privacy the port reset method in ACPI and the link reset method in SerDes firmware, effectively protecting the overall logic of fault handling .
  • the processor can also perform the following operations by calling the computer program stored in the memory: in the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, disable the abnormal PCIe port and the abnormal PCIe port. communication link.
  • recoverable failures may include one or more of the following: data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error , The response of the advanced extensible interface AXI bus is wrong.
  • the abnormal interrupt information corresponding to the abnormal PCIe port may include one or more of the following: the identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, and the information of the PCIe core where the abnormal PCIe port is located. ID, the ID of the central processing unit CPU connected to the PCIe core.
  • the fault handling device may further include a communication interface, and the processor specifically performs the following operations by calling the computer program stored in the memory: the processor receives the abnormal interrupt information corresponding to the abnormal PCIe port through the communication interface, and The abnormal interrupt information corresponding to the abnormal PCIe port is added to the preset work queue, and the abnormal interrupt information corresponding to the abnormal PCIe port is obtained from the preset work queue.
  • the preset work queue is used to store abnormal interrupt information corresponding to abnormal PCIe ports in each PCIe core.
  • the present application provides a fault handling apparatus, the apparatus including a module, a unit or a circuit for performing any one of the possible design methods of any of the above aspects.
  • modules, units or circuits can be implemented by hardware, or by executing corresponding software by hardware.
  • the present application provides a chip, which may include a processor and a communication interface, where the processor is configured to read an instruction through the communication interface, so as to execute the fault handling method according to any one of the above first aspects.
  • the present application provides a fault handling system, including a central processing unit and a peripheral component interconnection and transmission PCIe core, the PCIe core includes a root complex and at least one PCIe node, the central processing unit is connected to the root complex, and the root complex is in the root complex. At least one PCIe port is included, and the root complex is connected to at least one PCIe node through the at least one PCIe port.
  • the root complex can be used to generate abnormal interrupt information corresponding to an abnormal PCIe port in at least one PCIe port and report it to the central processing unit, and the central processing unit can be used for troubleshooting according to any one of the above first aspects. Method to troubleshoot the abnormal PCIe port.
  • the present application provides a computer-readable storage medium, the computer-readable medium stores a program code, when the program code is run on a computer, the computer is made to perform the fault handling as described in any one of the above-mentioned first aspects. method.
  • the present application provides a computer program product, including computer program code, which, when the computer program code is run on a computer, causes the computer to execute the fault handling method according to any one of the above-mentioned first aspect.
  • FIG. 1 exemplarily shows a schematic diagram of a system architecture to which an embodiment of the present application is applicable
  • FIG. 2 exemplarily shows a schematic flowchart of a fault handling method provided by an embodiment of the present application
  • FIG. 3 exemplarily shows a schematic flowchart corresponding to a reset method provided by an embodiment of the present application
  • FIG. 4 exemplarily shows a schematic diagram of a software and hardware architecture of a fault processing logic provided by an embodiment of the present application
  • FIG. 5 exemplarily shows a schematic flowchart of another fault processing method provided by an embodiment of the present application
  • FIG. 6 exemplarily shows a schematic structural diagram of a fault processing apparatus provided by an embodiment of the present application
  • FIG. 7 exemplarily shows a schematic structural diagram of another fault processing apparatus provided by an embodiment of the present application.
  • the fault handling method disclosed in this application can be applied to an electronic device that communicates based on a PCIe system.
  • the fault handling device may be an electronic device or an independent unit.
  • the fault handling device is an independent unit, the unit can be embedded in the electronic equipment, and can perform fault handling on the PCIe port of the electronic equipment, so as to maintain the high reliability of the PCIe system.
  • the fault processing apparatus may also be a unit packaged inside the electronic device, and is used to implement the fault processing function of the PCIe port of the electronic device.
  • the electronic device may be a server, memory, test instrument, or a portable electronic device containing functions such as personal digital assistants and/or music players, such as mobile phones, tablet computers, wearable devices with wireless communication capabilities (such as smart watches), or in-vehicle equipment, etc.
  • portable electronic devices include, but are not limited to, carry-on Or portable electronic devices with other operating systems, such as laptops (Laptops) or desktop computers with touch-sensitive surfaces (eg, touch panels).
  • FIG. 1 exemplarily shows a schematic diagram of a system architecture to which the embodiments of the present application are applicable.
  • the system architecture includes at least one central processing unit (CPU) and at least one PCIe core, and at least one central processing unit and at least one PCIe core may correspond one-to-one, as shown in FIG. 1
  • the CPU 1 corresponds to the PCIe core 1
  • the CPU 2 corresponds to the PCIe core 2.
  • PCIe cores are also known as PCIe systems.
  • Each PCIe core may include one root complex and at least one end node, and may also include at least one switch node and at least one bridge node.
  • the root complex is used to initialize the system and configure the communication links between the nodes when constructing the PCIe core, so as to connect the CPU corresponding to the PCIe core with the switching nodes, end nodes and bridge nodes in the PCIe core. One or more of them are connected one by one.
  • the central processing unit corresponding to the PCIe core can communicate with each node in the PCIe core by connecting to the root complex in the PCIe core.
  • the switching node connects the upstream root complex with one or more of the downstream end nodes, switching nodes or bridge nodes, respectively, and is used to route the data of the upstream root complex to one or more downstream nodes, or respectively.
  • the data of each downstream node is routed to the upstream unique root complex, or the data of a downstream node can be flexibly routed to another downstream node in a point-to-point manner.
  • the bridge node is used to realize the communication connection between the PCIe core and other PCI or other PCIe cores adopting other bus standards through non-transparent bridges (NTB) set in different bus systems.
  • NTB non-transparent bridges
  • the end node is usually located in a terminal application (Application, APP), and is responsible for connecting the terminal APP with other nodes in the PCIe core and completing PCIe-based transaction transmission.
  • APP terminal application
  • PCIe core 1 shown in Figure 1 as an example to further introduce the node connection method in each PCIe core:
  • the PCIe core 1 includes a root complex 1 , a switch node 1 , a bridge node 1 and four end nodes, namely end node 1 , end node 2 , end node 3 , and end node 4 .
  • switch node 1, end node 1 and bridge node 1 belong to the downstream nodes of root complex 1 (root complex 1 belongs to the upstream nodes of switch node 1, end node 1 and bridge node 1), while end node 2, end node 3 and end node 4 belong to the downstream node of switch node 1 (switch node 1 belongs to the upstream node of end node 2, end node 3 and end node 4).
  • Upstream nodes and corresponding downstream nodes can be connected by a PCI bus (see the thick black line shown in Figure 1).
  • the root complex 1 may also contain one or more PCIe ports, such as root ports (root Port, RP) 1, RP2, and RP3.
  • the root complex 1 can connect the downstream switching node 1, end node 1 and bridge node 1 through RP1, RP2 and RP3 respectively. In this way, the root complex 1 can communicate with the switching node 1, its downstream end node 2, and the downstream end node 1 through RP1.
  • data routing with end node 1 can be implemented through RP2, and data routing with a node in PCIe core 2 (eg, bridge node 2) can be implemented through RP3.
  • the PCIe core may also include more or less nodes than those shown in FIG. 1 , such as a greater number of switch nodes, end nodes or bridge nodes. , or include other types of nodes than root complexes, switch nodes, end nodes, and bridge nodes.
  • the root complex and the central processing unit can be connected one-to-one in the manner shown in Figure 1, or can be connected in a one-to-many or many-to-one manner.
  • a root complex can also be connected to at least two central processing units, respectively.
  • one central processing unit can also be connected to at least two root complexes, etc., respectively.
  • the central processing unit and the PCIe core can be deployed in the same physical entity, or they can be deployed in different physical entities, or the central processing unit and a part of the PCIe core nodes can be deployed in the same physical entity, and another part of the PCIe core can be deployed in the same physical entity.
  • the node is deployed in another physical entity, which is not specifically limited.
  • the root complex realizes the communication connection with other nodes in the PCIe core through each PCIe port set internally.
  • the services between the root complex and the terminal APP are actually distributed Processing is performed on the communication link corresponding to each PCIe port. Therefore, ensuring the normality of the connection between each PCIe port and each node is crucial to maintaining the service processing capability of the entire PCIe core.
  • the usual practice is to restart the entire PCIe core.
  • this method will make all PCIe ports in the PCIe core in an unavailable state, which not only cannot restore the services of the PCIe ports with connection problems, but also affects the services of other PCIe ports in the PCIe core, reducing the entire PCIe core and even reliability of the entire system.
  • the PCIe core may not be restarted, but only the PCIe port that has a connection problem with the downstream end node is disabled.
  • this method will make the disabled PCIe port and all downstream nodes mounted on the PCIe port in an unavailable state. Although the services of other PCIe ports will not be affected, the services of the disabled PCIe port will not be affected. but could not recover for a long time.
  • the above two methods can only deal with the connection failure when the downstream node is an end node, but cannot deal with the connection failure when the downstream node is other types of nodes (such as switching nodes or bridge nodes), which leads to the generality of fault handling. poor.
  • the present application provides a fault handling method for quickly recovering the services of an abnormal PCIe port without affecting the services of other PCIe ports, and further realizing the processing of connection failures of more types of downstream nodes .
  • FIG. 2 exemplarily shows a schematic flowchart of a fault processing method provided by an embodiment of the present application, and the method is applicable to the central processing unit in FIG. 1 , such as the central processing unit 1 or the central processing unit 2 shown in FIG. 1 .
  • the method includes:
  • Step 201 the central processing unit acquires abnormal interrupt information corresponding to the abnormal PCIe port.
  • the root complex can detect, in real time or periodically, the service processing status between each internal PCIe port and the downstream nodes mounted on each PCIe port.
  • the root complex can first interrupt the current service of the abnormal PCIe port, and then according to the current service of the fault.
  • the relevant information generates abnormal interrupt information corresponding to the abnormal PCIe port, and finally reports it to the connected central processing unit.
  • the abnormal interrupt information corresponding to the abnormal PCIe port may include the fault type, the identification of the abnormal PCIe port, and the identification of the PCIe core where the abnormal PCIe port is located, and may also include the type of the current service, the progress of service processing, or the information in the contact person information. one or more.
  • each interface in the PCIe core is PCIe
  • the interface between the central processing unit and the root complex is not PCIe and does not belong to the PCIe core.
  • the root complex can transmit the abort information to the central processor through a non-PCIe bus, such as through the file transfer protocol (FTP).
  • FTP file transfer protocol
  • the fault type of the abnormal PCIe port may include one or more of the following:
  • ECC error correcting code
  • CE correctable error
  • UE uncorrectable errors
  • data with only one bit error can be corrected into correct data, and data with two bit errors can be detected but cannot be corrected.
  • the root complex when the root complex detects that the service processing of a PCIe port is abnormal, it can call the ECC to preprocess the error first. The method locates the wrong 1-bit data by itself and can directly correct it to the correct data, and then continues to execute the current service of the PCIe port. In this case, the PCIe port is still a normal PCIe port.
  • the root complex can only detect which two bits are wrong but cannot correct itself.
  • the PCIe port in this case is an abnormal PCIe port, and the root complex can suspend the PCIe port first. The current service of the port, and then assemble the corresponding abnormal interruption information based on the position of the 2-bit error located by the ECC and report it to the central controller.
  • the root complex may label the failure type in the generated abort information as "two-bit data error".
  • the root complex can be composed of an application layer, a transaction transport layer (TL), a data link layer (DLL) and a physical layer.
  • the application layer When performing transaction processing between the central processor and the terminal APP, the application layer will first initiate a transaction transmission request to the transaction transport layer, and the transaction transport layer will generate the corresponding transaction transport layer package (transport layer package, TLP) and send it to the data link
  • TLP transport layer package
  • the data link layer adds a serial number and link cyclic redundancy check (LCRC) code to the TLP to generate the corresponding data link layer package (DLLP) and It is sent to the physical layer, and the transaction is transmitted on the PCIe link corresponding to the PCIe port in the physical layer.
  • LCRC serial number and link cyclic redundancy check
  • the data link layer After the data link layer sends the DLLP, it will wait for the response information from the physical layer to return a successful transmission. Only when the response information is received within the preset time period, the data link layer will confirm the bidirectional connection between the data link layer and the physical layer. The transaction was received correctly. According to the transaction processing flow, in the implementation, if the data link layer in the root complex has not received the response information of successful transmission returned by the physical layer for more than a preset period of time after sending the DLLP, the root complex can confirm the data link There is a problem in the transmission between the road layer and the physical layer. The problem may be an uncorrectable error caused by network delay. In this case, the PCIe port is an abnormal PCIe port. The root complex may generate abnormal interrupt information corresponding to the PCIe port for the PCIe port, and may mark the failure type as "data link layer packet transmission timeout error".
  • a node in a PCIe core can support up to 8 functions, such as audio, video, and more.
  • each function of the node has its own configuration space, and the relevant information of the function is stored in the configuration space.
  • the configuration space may be an independent storage unit in the node, for example, the size of the configuration space may be 256k.
  • Other nodes except the root complex can only see the relevant information of their own configuration space, and the root complex has the permission to read and write the configuration space of each node.
  • the root complex can read the information in the configuration space of any node through the transaction layer package to determine the functions supported by the node, or can write the configuration space of any node through the transaction layer package to complete the initialization and initialization of the node. Functional configuration. However, if the node being written to is not ready to respond to the root complex's request to write the configuration space, the node being written to will return the status to the root complex as "configuration retry status (CRS)" ” transaction layer response packet. This indicates that the root complex failed to successfully write to the node's configuration space. When the number of failures to write does not exceed a preset number of times, the failure to write is a correctable error, and the root complex can continue to be rewritten.
  • CRS configuration retry status
  • the PCIe port in this case is an abnormal PCIe port.
  • the root complex can generate corresponding abnormal interrupt information for the PCIe port, and can mark the fault type as "data link layer packet transmission timeout error".
  • the root complex in the PCIe core is connected with the corresponding central processing unit through an advanced eXtensible interface (AXI) bus. It means that the root complex and the central processing unit need to send and receive data in the manner specified by the AXI bus protocol.
  • AXI advanced eXtensible interface
  • the root complex can determine that there is a problem with the PCIe port.
  • the problem may be an uncorrectable error caused by the abnormality of the PCIe port, or an uncorrectable error caused by the abnormality of the transmission link between the PCIe port and the downstream node.
  • the PCIe port is Abnormal PCIe port.
  • the root complex can generate corresponding abnormal interrupt information for the PCIe port, and can mark the fault type as "AXI bus response error". It should be understood that the above content only takes the root complex detection of AXI bus response errors as an example to introduce this type of error. In other examples, AXI bus response errors can also be detected by the central processing unit without being reported by the root complex.
  • the fault type of an abnormal PCIe port may also be any other port-level uncorrectable error.
  • the fault of the abnormal PCIe port may be caused by the hardware fault of the PCIe port and the downstream node, or may be caused by the software fault of the PCIe port and the downstream node, which is not specifically limited in this application.
  • Step 202 the central processing unit determines the fault type according to the abnormal interrupt information corresponding to the abnormal PCIe port: if the fault type is an unrecoverable fault, step 203 is performed; if the fault type is a recoverable fault, step 204 is performed.
  • an unrecoverable fault refers to a fault that cannot be repaired directly or indirectly by the central processing unit and must be solved only by special personnel debugging, such as hardware damage or recording medium defect.
  • Recoverable faults refer to faults that can be repaired directly or indirectly by the central processing unit, such as faults that can be repaired by reset, upgrade, update, download patch or restart.
  • Step 203 the CPU disables the abnormal PCIe port and the corresponding communication link.
  • the central processing unit may further store a recoverable fault record table, and the recoverable fault record table records the recoverable fault types that the central processing unit can recover directly or indirectly.
  • the recoverable faults in the recoverable fault record table may include, for example, two-bit data errors indicated above, data link layer packet transmission timeout errors, too many retries for writing configuration space, AXI bus response errors, or One or more of the other recoverable port-level errors.
  • These recoverable failure types can be preset in the recoverable failure record table by R&D personnel based on their experience, or they can be learned and explored by the central processing unit in the process of executing the business and stored in the recoverable failure record table in real time. It may be obtained by the central processing unit from the interaction information obtained by other central processing units or network devices, etc., which is not specifically limited.
  • the central processing unit may first obtain the fault type of the abnormal PCIe port from the abnormal interrupt information, and then match the fault type of the abnormal PCIe port with the recoverable fault type in the recoverable fault record table. All recoverable fault types in the table do not match the fault types of the abnormal PCIe port, and the central processing unit may locate the fault of the abnormal PCIe port as an unrecoverable fault. Since unrecoverable faults cannot be recovered by non-computer programs or operating techniques, nor can they be corrected by error checking codes or other techniques, once the CPU detects an unrecoverable fault on a PCIe port, it can directly disable it.
  • the PCIe port is used to save the resources of the PCIe core and avoid the phenomenon that the unrecoverable fault of the abnormal PCIe port spreads to the entire PCIe core and causes the service failure of the entire PCIe core to occur.
  • the central processing unit may disable the abnormal PCIe port and the corresponding PCIe link in various ways, for example:
  • the central processing unit can call up the configuration interface of the basic input output system (BIOS) through hot keys or instructions, select the abnormal PCIe port on the configuration interface and issue a disable command, the The disable command will cause the root complex to disable the port function of the abnormal PCIe port and the node functions of all nodes mounted on the abnormal PCIe port according to the instruction instruction or by writing the configuration space;
  • BIOS basic input output system
  • the central processing unit can call up the onboard setting interface of the RAM chip (such as a complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) chip) in the motherboard, and set the abnormal PCIe port in the onboard setting interface
  • the slot is set to "Disabled", so that the root complex fails the abnormal PCIe port, thereby unloading all nodes mounted on the abnormal PCIe port.
  • the central processing unit can also generate a corresponding alarm message, and according to the contact person information in the abnormal interruption information The alarm information is pushed to the user, so that the user can know and repair the faulty PCIe port in time, and restore the services of the PCIe port as soon as possible.
  • Step 204 the CPU resets the abnormal PCIe port and the corresponding communication link.
  • the central processing unit may also store the reset method of each recoverable fault recorded in the recoverable fault record table, and the reset method of any recoverable fault may be reset, upgrade, update, download One or more of a patch or reboot.
  • the central processing unit may locate the fault of the abnormal PCIe port as a recoverable fault, and may use the target recoverable fault
  • the reset method corresponding to the recovery fault type resets the abnormal PCIe port and the communication link corresponding to the abnormal PCIe port. In this way, the central processing unit can not only restore the services of the abnormal PCIe port as soon as possible by resetting, but also can maintain the reliability of the PCIe core without affecting the services of other PCIe ports.
  • this method can synchronize the permission to write the configuration space of the port and the node in the PCIe core to the central processing unit in advance, so that the central processing unit can directly write the configuration space of the configuration space through the central processing unit.
  • the reset can also be accomplished by an indirect method in which the central processing unit sends a corresponding instruction to the root complex to drive the root complex to write the configuration space, which is not specifically limited.
  • FIG. 3 exemplarily shows a schematic flowchart corresponding to a reset method provided by an embodiment of the present application, and the method is applicable to a central processing unit, such as the central processing unit 1 or the central processing unit 2 shown in FIG. 1 .
  • the CPU indirectly writes the configuration space of ports and nodes in the PCIe core.
  • the method includes:
  • Step 301 the CPU deactivates the abnormal PCIe port.
  • the central processing unit may send a deactivation instruction for the abnormal PCIe port to the root complex, so as to instruct the root complex not to continue the current service of the abnormal PCIe port.
  • the root complex can also record the service processing progress of the abnormal PCIe port before the deactivation after deactivating the abnormal PCIe port, so as to continue to execute the service after recovery.
  • Step 302 the central processing unit disconnects the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port.
  • the central processing unit may send a removal instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of each node mounted on the abnormal PCIe port, and attach the abnormal PCIe port to the
  • the loaded nodes are removed from the current communication link, and the connection relationship between the abnormal PCIe port and each node mounted on the abnormal PCIe port is disconnected.
  • Step 303 the CPU fails the abnormal PCIe port.
  • the central processing unit may send an invalidation instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of the abnormal PCIe port, and switch the abnormal PCIe port from the enable state to To enable (disable) state.
  • "enable” means “enable”.
  • the abnormal PCIe port When the abnormal PCIe port is in the enabled state, the abnormal PCIe port has the ability to process the data sent by the upstream root complex or the data reported by the downstream node.
  • the abnormal PCIe port is in the disabled state, the abnormal PCIe port does not have the ability to process the data sent by the upstream root complex or the data reported by the downstream node.
  • Step 304 the central processing unit resets the media access control (media access control, MAC) logic of the PCIe core.
  • media access control media access control, MAC
  • the central processing unit may send a MAC reset command for the abnormal PCIe port to the root complex, so as to drive the root complex to initialize the MAC logic of the entire PCIe core and restore the MAC layer communication mechanism of the abnormal PCIe port.
  • the abnormality of the abnormal PCIe port can be recovered in a targeted manner, and on the basis of improving the efficiency of PCIe port reset, saving Processing resources of the CPU and PCIe cores.
  • Step 305 the CPU resets the SerDes link parameter corresponding to the PCIe core.
  • each communication link in the PCIe core is managed by a serializer/deserializer (SerDes).
  • SerDes is preset with default SerDes link parameters, and the SerDes follows the default SerDes
  • the link parameters convert parallel transmit data to serial transmit data, or convert serial receive data to parallel receive data according to the default SerDes link parameters.
  • the default SerDes link parameters may no longer be suitable for the PCIe core, resulting in an abnormality in the communication link where some PCIe ports in the PCIe core are located. In this case, the SerDes link parameters in the SerDes need to be adjusted to suit the current environment.
  • the central processing unit may send a link reset command for the abnormal PCIe port to the root complex, so as to drive the root complex to adaptively calibrate SerDes link parameters corresponding to the abnormal PCIe port.
  • the calibrated SerDes link parameters may be obtained by obtaining the current environmental parameters (such as temperature) and substituting them into the preset formula for calculation, or may be obtained by performing closed-loop feedback adjustment according to the adjusted execution effect until it converges. It may also be randomly selected, which is not specifically limited.
  • Step 306 the central processing unit rebuilds the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port.
  • the central processing unit may send a rebuild chain instruction for the entire PCIe core to the root complex, so that the root complex rebuilds the entire topology of the PCIe core, or it may send an instruction to the root complex only for exceptions
  • the rebuild chain instruction of the PCIe port enables the root complex to directly detect the abnormal PCIe port and all nodes mounted on the abnormal PCIe port, and then add it to the topology of the existing PCIe core.
  • the root complex 1 can sequentially traverse the bus paths where the PCIe ports RP1, RP2, and RP3 set internally are located:
  • Root complex 1 traverses the downstream bus of RP1, finds switching node 1, and assigns a bus address to switching node 1 (which may include bus number, node number, and function number (bus device and function number, BDF), etc.); root complex 1 follows The depth-first rule continues to traverse the nodes connected to the downstream bus of switch node 1, finds end node 2, and assigns a bus address to end node 2; since end node 2 does not have a downstream bus, root complex 1 continues to traverse the downstream of switch node 1.
  • switching node 1 which may include bus number, node number, and function number (bus device and function number, BDF), etc.
  • the node connected to the bus finds end node 3 and assigns a bus address to end node 3; since end node 3 does not have a downstream bus, root complex 1 continues to traverse the nodes connected to the downstream bus of switch node 1 and finds end node 4 , assign the bus address to the end node 4; so far, the bus link traversal where RP1 is located is completed;
  • Root complex 1 traverses the downstream bus of RP2, finds end node 1, and assigns a bus address to end node 1; so far, the traversal of the bus link where RP2 is located is completed;
  • Root complex 1 traverses the downstream bus of RP3, finds bridge node 1, and assigns a bus address to bridge node 1; root complex 1 records the bridge address of bridge node 1 at the same time, so as to establish an association with the topology obtained by PCIe core 2 traversal; So far, the traversal of the bus link where RP3 is located is completed.
  • the root complex 1 can allocate bus addresses to all nodes in the PCIe core 1, and can construct the topology of the entire PCIe core 1.
  • Step 307 the CPU activates the abnormal PCIe port.
  • the central processing unit may send an effective instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of the abnormal PCIe port, and switch the abnormal PCIe port from the disabled state to the disabled enable state to restore the abnormal PCIe port's ability to process the data sent by the upstream root complex or the data reported by the downstream node.
  • Step 308 the CPU enables the abnormal PCIe port.
  • the central processing unit may send an enable instruction for the abnormal PCIe port to the root complex, so as to instruct the root complex to enable the service processing of the abnormal PCIe port.
  • the root complex can re-execute the current service of the abnormal PCIe port, and can also continue to execute the current service of the abnormal PCIe port from the service processing progress recorded when it was deactivated, so as to save the processing resources of the PCIe core and improve the service at the same time.
  • the current service may not be executed, but the new service may be directly executed after the subsequent new service arrives, which is not specifically limited.
  • this solution can also handle any recoverable faults of any PCIe node (such as end node, switch node or bridge node) that appear in the PCIe core, not just the end node or a certain type of failure, so This method can also effectively detect and repair various faults in the PCIe core, further improving the high reliability of the entire PCIe core.
  • PCIe node such as end node, switch node or bridge node
  • the above-mentioned first embodiment actually takes the central processing unit as the execution body as an example to introduce the specific implementation process of the fault handling, which is only an optional implementation manner.
  • the fault handling scheme can also be executed directly by the root complex.
  • the root complex detects that an internal PCIe port is abnormal, the PCIe port and the corresponding communication link can be directly reset without reporting to the central processing unit for processing.
  • this implementation increases the working pressure of the root complex, it can save the communication overhead between the PCIe core and the central processing unit, and can handle faults faster.
  • the central processing unit may run in a Linux operating system.
  • the virtual memory space in the Linux operating system is divided into kernel space (kernel space) and user space (user space). Kernel space is the running space of the kernel in the Linux operating system, while user space is the running space of user programs. Kernel space and user space are isolated from each other, and even if the user program crashes, the kernel space will not be affected.
  • FIG. 4 exemplarily shows a schematic diagram of the software and hardware architecture of a fault processing logic provided by an embodiment of the present application.
  • the fault processing solution can be implemented in hardware by a central processing unit provided in the chip hardware. Interaction with the root complex is done. Chip hardware can also be connected to other peripherals, such as memory, input and output devices, or drive devices.
  • the overall logic of the software for fault handling is encapsulated in the central processing unit. advanced configuration and power management interface, ACPI) and SerDes firmware.
  • ACPI advanced configuration and power management interface
  • SerDes firmware the program of the RAS firmware or SerDes firmware can be pre-written into the memory, the RAS firmware is mainly used for processing interrupts, and the SerDes firmware is mainly used for rebuilding the chain.
  • the central processing unit may execute the method logic corresponding to the RAS firmware or the SerDes firmware by calling the RAS firmware or the SerDes firmware in the memory.
  • ACPI defines various working interfaces between the operating system, BIOS and system hardware. ACPI can be implemented in the BIOS or system hardware and can be invoked or triggered by the operating system.
  • the PCIe driver is located in the kernel space of the Linux system, and is used to manage the enabling or disabling of each port in the PCIe core and the connection relationship between each port and each node. PCIe drivers can be open sourced to the community.
  • the memory in this embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • FIG. 4 is only an exemplary introduction, and in other examples, the central processing unit and the root complex may also be provided in different chip hardware.
  • the memory and the central processing unit may be located in the same physical entity (eg, chip hardware), or may be located in different physical entities.
  • FIG. 5 exemplarily shows a schematic flowchart of another fault handling method provided by an embodiment of the present application.
  • the method is applicable to chip hardware, RAS firmware, ACPI, PCIe driver, and SerDes firmware.
  • the chip hardware is provided with a central processing unit and an exception.
  • PCIe ports, PCIe drivers are located in the kernel space, and the kernel space interacts with the user space.
  • the method includes:
  • Step 501 when an abnormal PCIe port fails, trigger a RAS interrupt.
  • the fault of the abnormal PCIe port when the fault of the abnormal PCIe port is a correctable fault, the fault can be repaired by the root complex before triggering the RAS interruption.
  • the fault of the abnormal PCIe port when the fault of the abnormal PCIe port is an uncorrectable fault, since the root complex cannot repair the fault, the fault will trigger the root complex to interrupt the current service of the abnormal PCIe port and report the current RAS interruption to the CPU.
  • Relevant information such as the identification of the abnormal PCIe port (core port ID, such as number), the fault type (error type), and the identification of the PCIe core where the abnormal PCIe port is located (core ID, such as number), etc.
  • Step 502 the RAS firmware calls ACPI to generate a corresponding ACPI interrupt event according to the RAS interrupt and reports it to the PCIe driver.
  • the central processing unit will call the RAS firmware to process the RAS interrupt, generate the corresponding ACPI interrupt event according to the relevant information of the RAS interrupt and the relevant information (such as the identification) of the central processing unit, and report it to the kernel PCIe driver in space.
  • Step 503 the PCIe driver generates fault information corresponding to the abnormal PCIe port according to the ACPI interrupt event, and adds the fault information to a work queue (ie, a preset work queue).
  • a work queue ie, a preset work queue
  • the PCIe driver will respond to the ACPI interrupt event related to the PCIe core, firstly extract the identification of the abnormal PCIe port, the fault type and the identification of the PCIe core where the abnormal PCIe port is located from the ACPI interrupt event, and then according to these
  • the information and the identification of the chip (socket ID, such as number) set by the PCIe core are assembled into fault information according to the set message structure, and then added to the work queue.
  • the work queue is located in the kernel space and is used to store each fault information that occurs in the PCIe core.
  • the central processing unit may set a corresponding work queue for each PCIe core in the kernel space, and the PCIe driver in any central processing unit adds fault information that occurs in the PCIe core connected to the central processing unit to the PCIe core
  • the work queue corresponding to the core, or, the central processing unit can also set the same work queue for all PCIe cores in the kernel space, and the PCIe driver in each central processing unit will record the faults that occur in the PCIe cores connected to each central processing unit.
  • the information is all added to the same work queue, or the central processing unit may also set a work queue for some PCIe cores in the kernel space, another work queue for another part of the PCIe core, and so on.
  • Step 504 the PCIe driver takes out a piece of fault information from the work queue, and determines whether the fault type in the fault information is a recoverable fault, if not, executes step 505 , if yes, executes step 506 .
  • the PCIe driver can take out the fault information from the work queue in the order of first-in, first-out, so as to restore each PCIe port in order from early to late according to the time of the fault, or it can also start from the first-in-last-out method.
  • the fault information is retrieved from the work queue to recover the most recent PCIe port fault first and then the earlier PCIe port fault.
  • the fault information can also be retrieved from the work queue in order of the fault degree from heavier to lighter. There is no specific limitation.
  • the central processing unit includes multiple processes, such as including multiple central processing unit cores, and the PCIe driver may also call the multiple central processing unit cores to comprehensively process each piece of fault information.
  • the PCIe driver can pre-allocate each fault information to multiple CPU cores in a balanced manner, or it can call the most idle CPU core to process a piece of fault information to be processed, and can also allocate a piece of fault information to be processed. It is not limited to the CPU core that is best at processing the current business.
  • Step 505 the PCIe driver pushes the fault information to the user space.
  • step 505 when the PCIe driver cannot recover the abnormal PCIe fault by itself, by pushing the fault to the user, the user can be notified in time to facilitate early manual maintenance and prevent the PCIe port from being in an unavailable state for a long time.
  • each time the PCIe driver adds a piece of fault information to the work queue it can also push a log message to the user space, so that the user can know the abnormal situation of the entire PCIe core and the current working pressure of the central processing unit in time.
  • Step 506 the PCIe driver in conjunction with the ACPI and SerDes firmware executes the overall logic of fault handling, and the execution process includes the following steps:
  • Step 5061 the PCIe driver calls the PCIe public interface to disable the abnormal PCIe port
  • Step 5062 the PCIe driver calls the PCIe public interface to remove all nodes mounted on the abnormal PCIe port;
  • the PCIe public interface may be located in a public register, and the methods in the public register may be open sourced to the community and visible to other users.
  • the implementation manner of the PCIe public interface may refer to the existing logic, which will not be repeated here.
  • Step 5063 the PCIe driver invokes the port reset method in the ACPI interface.
  • the port reset method is programmed according to the programming language specified by the ACPI interface, and added as an interface in the ACPI interface logic.
  • the PCIe driver can execute the corresponding port reset logic by calling the interface name corresponding to the port reset method.
  • the interface name of the port reset method can be set by yourself, for example, it is set to RP reset.
  • Step 5064 the PCIe driver first resets the MAC logic of the PCIe core according to the port reset method
  • Step 5065 the PCIe driver resets the SerDes link parameters corresponding to the PCIe core by calling the SerDes firmware according to the port reset method;
  • the port reset method may be located in the common register of the chip, and the PCIe driver directly implements the invocation of the PCIe public interface and the port reset method in the common register of the chip through in-chip messages, thereby reducing message transmission channels between chips.
  • the port reset method can also be located in the private register of the chip. After the PCIe driver calls the PCIe public interface in the public register of the chip, it then calls the port reset method in the private register of the chip through the message calling method in the chip to protect privacy. Port reset method.
  • Step 5066 the PCIe driver resets the SerDes link parameters corresponding to the PCIe core according to the SerDes firmware
  • the SerDes firmware can be programmed according to any operating language, such as C++, Phython, and so on.
  • the SerDes link When the SerDes link is reset, the SerDes link set in the chip will receive the reset command driven by the PCIe driver, and then the parameters of the SerDes link will be calibrated according to the current environment, and the parameters will be adaptively calibrated again according to the execution effect of the calibrated parameters. , in order to try to restore the SerDes link corresponding to the abnormal PCIe port in the PCIe core to the best state.
  • the SerDes firmware may be located in a private register of the chip. Since the PCIe driver implements the SerDes link reset method by calling the SerDes firmware in the port reset method, even if the port reset method is open sourced to the community, the SerDes link reset method provided by the SerDes firmware is not visible to the outside world, which helps maximize the Ensure the security of the SerDes link reset method.
  • Step 5067 the PCIe driver determines that the SerDes firmware invocation is completed, and returns to the port reset method
  • Step 5068 the PCIe driver determines that the port reset method call is completed, and returns to the PCIe public interface
  • Step 5069 the PCIe driver rebuilds the topology of the PCIe core by enumerating and traversing.
  • the central processing unit can complete the fault perception, reset and service of the PCIe port through the combination of software and hardware. recover.
  • the recovery of the failure of one PCIe port will not affect the normal services of other PCIe ports.
  • the PCIe fault recovery driver is open-sourced to the community, it will not expose the SerDes firmware set in the chip's private register, or the port reset method and SerDes firmware.
  • FIG. 6 is a schematic structural diagram of a fault processing apparatus 600 provided by an embodiment of the present application, and the fault processing apparatus 600 may be a chip or a circuit, such as a chip or a circuit that may be provided in a central processing unit.
  • the fault processing device 600 may correspond to the central processing unit in the above method.
  • the fault handling apparatus 600 may implement any one or more of the corresponding method steps as shown in FIG. 2 to FIG. 5 .
  • the fault handling device 600 may include a monitoring circuit 601 and a processing circuit 602.
  • the fault processing device 600 may further include a bus system, and the monitoring circuit 601 and the processing circuit 602 may be connected through the bus system.
  • the monitoring circuit 601 can also be connected to each PCIe port in the PCIe core through the bus system
  • the processing circuit 602 can also be connected to the root complex in the PCIe core through the bus system.
  • the monitoring circuit 601 may receive the abnormal interrupt information reported by the abnormal PCIe port and send it to the processing circuit 602 .
  • the processing circuit 602 can first determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information, and reset the abnormal PCIe port and the communication link of the abnormal PCIe port when the fault type corresponding to the abnormal PCIe port is a recoverable fault. , to restore the connectivity between the abnormal PCIe port and the PCIe node.
  • FIG. 7 is a schematic structural diagram of another fault processing apparatus 700 provided by an embodiment of the present application.
  • the fault processing apparatus 700 may be a chip or a circuit, such as a chip or a circuit that may be provided in a central processing unit.
  • the fault processing device 700 may correspond to the central processing unit in the above method.
  • the fault handling apparatus 700 may implement any one or more of the corresponding method steps shown in FIG. 2 to FIG. 5 .
  • the fault processing apparatus 700 may include a communication interface 701 , a determination unit 702 and a processing unit 703 .
  • the communication interface 701 may be a receiving unit or a receiver when receiving information, and the receiving unit or receiver may be a radio frequency circuit.
  • the communication interface 701 can receive the abnormal interruption information reported by the abnormal PCIe port, and the determining unit 702 can determine the fault type corresponding to the abnormal PCIe port according to the abnormal interruption information.
  • the processing unit 703 may reset the abnormal PCIe port and the communication link of the abnormal PCIe port to restore the connection relationship between the abnormal PCIe port and the PCIe node.
  • each unit in the above-mentioned fault processing apparatus 700 may refer to the implementation of the corresponding method embodiments, which will not be repeated here.
  • the division of the units of the above fault processing apparatus 700 is only a division of logical functions, and in actual implementation, all or part of them may be integrated into one physical entity, or may be physically separated.
  • the communication interface 701 may be implemented by the monitoring circuit 601 in the foregoing FIG. 6
  • the determining unit 702 and the processing unit 703 may be implemented by the processing circuit 602 in the foregoing FIG. 6 .
  • the present application further provides a fault processing system, where the fault processing system includes the central processing unit and the PCIe core described in any of the foregoing contents.
  • the PCIe core includes a root complex and at least one PCIe node, and the root complex is connected to a downstream PCIe node through at least one PCIe port set inside.
  • the central processing unit may execute the method of any one of the embodiments shown in FIG. 1 to FIG. 5 to implement fault processing for an abnormal PCIe port in the at least one PCIe port.
  • the present application also provides a computer program product, the computer program product includes: computer program code, when the computer program code is run on a computer, the computer is made to execute the steps shown in FIG. 1 to FIG. 5 .
  • the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores program codes, and when the program codes are run on a computer, the computer is made to execute FIG. 1 to FIG. 5 .
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device may be components.
  • One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.
  • data packets eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A fault handling method and apparatus, and a system, which are used to maintain high reliability of a peripheral component interconnect express (PCIe) system. The method comprises: a central processing unit acquiring abnormal interrupt information corresponding to an abnormal PCIe port; according to the abnormal interrupt information, determining the type of fault corresponding to the abnormal PCIe port; and when it is determined that the fault is a recoverable fault, resetting the abnormal PCIe port and a communications link of the abnormal PCIe port. By means of the fault handling solution, the communication service capability of an abnormal PCIe port can be recovered in a timely manner, without affecting other PCIe ports in a PCIe system, thereby being beneficial to maintaining the high reliability of the PCIe system.

Description

一种故障处理方法、装置及系统A fault handling method, device and system 技术领域technical field
本申请涉及通信技术领域,尤其涉及一种故障处理方法、装置及系统。The present application relates to the field of communication technologies, and in particular, to a fault handling method, device, and system.
背景技术Background technique
外围组件互连传递(peripheral component interconnect express,PCIe)是一种高速短距离的通信接口。该通信接口能快速读写内存,并能支持超高宽带通信,目前已广泛应用在网络、通信、存储、工业和消费类电子产品等各个领域。Peripheral component interconnect express (PCIe) is a high-speed short-distance communication interface. The communication interface can quickly read and write memory, and can support ultra-high-bandwidth communication. It has been widely used in various fields such as network, communication, storage, industrial and consumer electronic products.
PCIe系统的主要组成单元包括根复合体(root complex,RC)、交换节点(switch)以及端节点(endpoint,EP)。其中,根复合体用于管理PCIe系统中的所有总线和所有节点,是PCIe系统中节点与节点之间通信的桥梁。一个根复合体中可以包含多个PCIe端口,根复合体通过多个PCIe端口分别连接至多个节点,如多个端节点或多个交换节点。一个交换节点可以用于连通根复合体和其它交换节点,或用于连通根复合体和端节点,是PCIe系统中数据的转发节点。端节点为端设备,如外围设备(peripheral)等,用于接收数据或发送数据。The main components of a PCIe system include a root complex (root complex, RC), a switch node (switch), and an end node (endpoint, EP). Among them, the root complex is used to manage all buses and all nodes in the PCIe system, and is a bridge for communication between nodes in the PCIe system. A root complex may contain multiple PCIe ports, and the root complex is respectively connected to multiple nodes, such as multiple end nodes or multiple switching nodes, through multiple PCIe ports. A switch node can be used to connect the root complex and other switch nodes, or to connect the root complex and end nodes, and is a data forwarding node in the PCIe system. An end node is an end device, such as a peripheral device, for receiving data or sending data.
现有方案中,当某个PCIe端口发生不可纠正错误时,为维持该PCIe端口的可用性,整个PCIe系统会通过重启的方式来恢复该PCIe端口。然而,这意味着整个PCIe系统的全部PCIe端口在重启的这段时间内都处于不可用状态,显然会降低PCIe系统的可靠性。In the existing solution, when an uncorrectable error occurs on a PCIe port, in order to maintain the availability of the PCIe port, the entire PCIe system will recover the PCIe port by restarting. However, this means that all PCIe ports of the entire PCIe system are in an unavailable state during the restart period, which obviously reduces the reliability of the PCIe system.
发明内容SUMMARY OF THE INVENTION
本申请提供一种故障处理方法、装置及系统,用以解决现有技术通过重启整个PCIe系统来恢复异常PCIe端口所导致的PCIe系统的可靠性较低的技术问题。The present application provides a fault handling method, device and system to solve the technical problem of low reliability of the PCIe system caused by restarting the entire PCIe system to restore an abnormal PCIe port in the prior art.
第一方面,本申请提供一种故障处理方法,该方法适用于中央处理器,中央处理器可直接或间接地连接至PCIe系统中的各PCIe端口。该方法包括:中央处理器接收到异常PCIe端口上报的异常中断信息后,先根据该异常中断信息确定异常PCIe端口所对应的故障类型,在确定故障类型为可恢复故障时,重置该异常PCIe端口及该异常PCIe端口的通信链路。其中,异常PCIe端口的通信链路用于连通该异常PCIe端口及PCIe节点。在上述设计中,通过重置可恢复的异常PCIe端口及对应的通信链路,不仅能及时恢复异常PCIe端口的可用性,保持异常PCIe端口的通信服务能力,还无需重启整个PCIe系统,如此,该故障处理方案在恢复异PCIe端口的同时还不会对PCIe系统中的其它PCIe端口产生影响,有助于维持PCIe系统的高可靠性。In a first aspect, the present application provides a fault handling method, which is applicable to a central processing unit, and the central processing unit can be directly or indirectly connected to each PCIe port in a PCIe system. The method includes: after the central processing unit receives the abnormal interruption information reported by the abnormal PCIe port, firstly determines the fault type corresponding to the abnormal PCIe port according to the abnormal interruption information, and when it is determined that the fault type is a recoverable fault, resetting the abnormal PCIe port Port and the communication link of the abnormal PCIe port. The communication link of the abnormal PCIe port is used to connect the abnormal PCIe port and the PCIe node. In the above design, by resetting the recoverable abnormal PCIe port and the corresponding communication link, not only the availability of the abnormal PCIe port can be restored in time, but the communication service capability of the abnormal PCIe port can be maintained, and the entire PCIe system does not need to be restarted. The fault handling solution will not affect other PCIe ports in the PCIe system while recovering the different PCIe ports, which helps to maintain the high reliability of the PCIe system.
在一种可能的设计中,异常PCIe端口设置于根复合体,PCIe节点可以为根复合体所连接的任一节点,如端节点、交换节点或桥节点。通过该设计,该方案不仅能检测到端节点所出现的链路异常,还能检测到其它类型节点(如交换节点或桥节点)所出现的链路异常,通过识别并修复整个PCIe系统所出现的各种类型的链路异常,能最大可能地维持PCIe系统的可用性。In a possible design, the abnormal PCIe port is provided in the root complex, and the PCIe node can be any node connected to the root complex, such as an end node, a switch node, or a bridge node. Through this design, the scheme can not only detect the link abnormality of the end node, but also detect the link abnormality of other types of nodes (such as switching nodes or bridge nodes). Various types of link anomalies can maintain the availability of the PCIe system to the greatest extent possible.
在一种可能的设计中,中央处理器可通过复位异常PCIe端口所在的PCIe核的介质访问控制层(media access control,MAC)逻辑,实现对异常PCIe端口的重置。在该设计中, 通过只复位与端口相关的MAC层逻辑,而不复位跟端口无关的其它逻辑,能有针对性的对异常PCIe端口所出现的异常进行恢复,在提高PCIe端口重置的效率的基础上,节省中央处理器和PCIe核的处理资源。In a possible design, the central processing unit may reset the abnormal PCIe port by resetting the media access control layer (media access control, MAC) logic of the PCIe core where the abnormal PCIe port is located. In this design, by only resetting the MAC layer logic related to the port and not resetting other logic unrelated to the port, the abnormality of the abnormal PCIe port can be recovered in a targeted manner, which improves the efficiency of the PCIe port reset. On the basis of this, the processing resources of the central processing unit and PCIe core are saved.
在一种可能的设计中,中央处理器可通过复位异常PCIe端口所在的PCIe核对应的串行器/解调器(serializer/deserializer,SerDes)链路参数,实现对异常PCIe端口的通信链路的重置。如此,该设计能在周围的环境发生变化时,校准异常PCIe端口的当前SerDes链路参数,通过将其调整为适应当前环境的参数,来恢复异常PCIe端口的通信质量。In a possible design, the central processing unit can realize the communication link to the abnormal PCIe port by resetting the serializer/deserializer (SerDes) link parameter corresponding to the PCIe core where the abnormal PCIe port is located reset. In this way, the design can calibrate the current SerDes link parameters of the abnormal PCIe port when the surrounding environment changes, and restore the communication quality of the abnormal PCIe port by adjusting them to parameters suitable for the current environment.
在一种可能的设计中,中央处理器在复位异常PCIe端口所在的PCIe核对应的串行器/解调器SerDes链路参数之前,还可以先断开异常PCIe端口与挂载在异常PCIe端口上的PCIe节点之间的通信链路,在复位异常PCIe端口所在的PCIe核对应的串行器/解调器SerDes链路参数之后,再重建异常PCIe端口与挂载在异常PCIe端口上的PCIe节点之间的通信链路。在该设计中,通过在重置异常PCIe端口之前先移除异常PCIe端口上挂载的各节点,能解耦异常PCIe端口与PCIe核中的其它节点,有助于实现异常PCIe端口的独立复位。In a possible design, before resetting the serializer/demodulator SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located, the CPU can also disconnect the abnormal PCIe port and mount it on the abnormal PCIe port After resetting the serializer/demodulator SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located, rebuild the abnormal PCIe port and the PCIe mounted on the abnormal PCIe port Communication link between nodes. In this design, by removing each node mounted on the abnormal PCIe port before resetting the abnormal PCIe port, the abnormal PCIe port can be decoupled from other nodes in the PCIe core, which is helpful to realize the independent reset of the abnormal PCIe port .
在一种可能的设计中,中央处理器若确定异常PCIe端口对应的故障类型为不可恢复故障,则可以禁用异常PCIe端口及异常PCIe端口的通信链路,以节省PCIe核的资源,尽量避免异常PCIe端口的不可恢复故障扩散到整个PCIe核而导致整个PCIe核的业务故障的现象发生。In a possible design, if the CPU determines that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, it can disable the abnormal PCIe port and the communication link of the abnormal PCIe port to save the resources of the PCIe core and try to avoid abnormality The phenomenon occurs that the unrecoverable fault of the PCIe port spreads to the entire PCIe core and causes the service failure of the entire PCIe core.
在一种可能的设计中,可恢复故障可以包括如下内容中的一项或多项:数据链路层包传输超时错误、事务层包写配置空间的重试次数过多错误、两比特数据错误、先进可扩展接口AXI总线的响应错误。如此,该设计能修复与PCIe端口相关的各种类型的错误,如与PCIe节点通信相关的数据链路层包传输超时错误和事务层包写配置空间的重试次数过多错误、与自身数据存储相关的两比特数据错误、与中央处理器传输相关的AXI总线响应错误,有助于更全面地维持PCIe端口的可用性。In one possible design, recoverable failures may include one or more of the following: data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error , The response of the advanced extensible interface AXI bus is wrong. In this way, the design can repair various types of errors related to PCIe ports, such as data link layer packet transmission timeout errors and transaction layer packet write configuration space too many retries errors related to PCIe node communication, and self data. Storage-related two-bit data errors, and AXI bus response errors related to CPU transfers help maintain PCIe port availability more fully.
在一种可能的设计中,异常PCIe端口对应的异常中断信息中可以包括如下内容中的一项或多项:异常PCIe端口的标识、异常PCIe端口对应的故障类型、异常PCIe端口所在的PCIe核的标识、PCIe核连接的中央处理器CPU的标识。如此,中央处理器通过解析异常PCIe端口对应的异常中断信息,即可获取到与当前异常相关的一些特征,以便于推算出异常PCIe端口对应的故障类型。In a possible design, the abnormal interrupt information corresponding to the abnormal PCIe port may include one or more of the following contents: the identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, and the PCIe core where the abnormal PCIe port is located. , and the ID of the central processing unit (CPU) connected to the PCIe core. In this way, by analyzing the abnormal interrupt information corresponding to the abnormal PCIe port, the central processing unit can obtain some features related to the current abnormality, so as to calculate the fault type corresponding to the abnormal PCIe port.
在一种可能的设计中,中央处理器可以从预设工作队列中获取异常PCIe端口对应的异常中断信息,该预设工作队列中存储有各PCIe核中的异常PCIe端口对应的异常中断信息。如此,该设计能通过预设工作队列集中管理各PCIe核中出现的端口异常,有效提高对各PCIe核中的异常PCIe端口恢复的灵活性。In a possible design, the central processing unit may obtain abnormal interrupt information corresponding to the abnormal PCIe port from a preset work queue, where the predetermined work queue stores abnormal interrupt information corresponding to the abnormal PCIe port in each PCIe core. In this way, the design can centrally manage port exceptions occurring in each PCIe core through a preset work queue, effectively improving the flexibility of recovery of abnormal PCIe ports in each PCIe core.
第二方面,本申请提供一种故障处理装置,包括处理器及存储器,存储器中存储有计算机程序。在实施中,处理器通过调用存储器中存储的计算机程序,可以执行如下操作:获取异常外围组件互连传递PCIe端口对应的异常中断信息,根据异常中断信息确定异常PCIe端口对应的故障类型,在异常PCIe端口对应的故障类型为可恢复故障的情况下,重置异常PCIe端口及异常PCIe端口的通信链路。其中,异常PCIe端口的通信链路用于连通PCIe端口及PCIe设备。In a second aspect, the present application provides a fault handling device, including a processor and a memory, and a computer program is stored in the memory. In implementation, by calling the computer program stored in the memory, the processor can perform the following operations: obtain abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port, determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information, If the fault type corresponding to the PCIe port is a recoverable fault, reset the abnormal PCIe port and the communication link of the abnormal PCIe port. The communication link of the abnormal PCIe port is used to connect the PCIe port and the PCIe device.
在一种可能的设计中,PCIe节点可以为端节点、交换节点或桥节点中的一个或多个。In one possible design, a PCIe node may be one or more of an end node, a switch node, or a bridge node.
在一种可能的设计中,该故障处理装置还可以包括高级配置和电源管理接口(advanced configuration and power management interface,ACPI)。处理器通过调用ACPI,可以执行复位异常PCIe端口所在的PCIe核的介质访问控制层MAC逻辑。In one possible design, the fault handling device may also include an advanced configuration and power management interface (ACPI). By invoking ACPI, the processor can execute the MAC logic of the media access control layer of the PCIe core where the abnormal PCIe port is located.
在一种可能的设计中,该故障处理装置还可以包括串行解调器SerDes固件,处理器通过调用ACPI,还可以在复位异常PCIe端口所在的PCIe核的介质访问控制层MAC逻辑之后,调用SerDes固件,通过调用SerDes固件,可以复位异常PCIe端口所在的PCIe核对应的SerDes链路参数。In a possible design, the fault handling device may further include SerDes firmware of a serial demodulator. By calling ACPI, the processor may also call the MAC logic of the media access control layer of the PCIe core where the abnormal PCIe port is located after resetting the MAC logic of the PCIe core. The SerDes firmware can reset the SerDes link parameters corresponding to the PCIe core where the abnormal PCIe port is located by calling the SerDes firmware.
在一种可能的设计中,该故障处理装置还可以包括PCIe驱动。处理器通过调用PCIe驱动,可以断开异常PCIe端口与挂载在异常PCIe端口上的PCIe节点之间的通信链路,并调用ACPI,通过调用SerDes固件,可以复位异常PCIe端口所在的PCIe核对应的SerDes链路参数之后,并返回调用PCIe驱动,通过返回调用PCIe驱动,可以重建异常PCIe端口与挂载在异常PCIe端口上的PCIe节点之间的通信链路。In a possible design, the fault handling device may further include a PCIe driver. By calling the PCIe driver, the processor can disconnect the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port, and call ACPI. By calling the SerDes firmware, it can reset the PCIe core corresponding to the abnormal PCIe port. After setting the SerDes link parameters, and returning to call the PCIe driver, by returning to call the PCIe driver, the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port can be rebuilt.
在一种可能的设计中,存储器可以包括公共寄存器和私有寄存器:In one possible design, the memory can include public and private registers:
一种情况下,PCIe驱动和ACPI可以存储于公共寄存器,SerDes固件可以存储于私有寄存器,以隐私化SerDes固件中的链路复位方法,有效保护SerDes链路参数复位的实现逻辑;In one case, PCIe driver and ACPI can be stored in public registers, and SerDes firmware can be stored in private registers, so as to privacy the link reset method in SerDes firmware and effectively protect the implementation logic of SerDes link parameter reset;
另一种情况下,PCIe驱动可以存储于公共寄存器,ACPI和SerDes固件可以存储于私有寄存器,以隐私化ACPI中的端口复位方法和SerDes固件中的链路复位方法,有效保护故障处理的整体逻辑。In another case, the PCIe driver can be stored in the public register, and the ACPI and SerDes firmware can be stored in the private register, so as to privacy the port reset method in ACPI and the link reset method in SerDes firmware, effectively protecting the overall logic of fault handling .
在一种可能的设计中,处理器通过调用存储器中存储的计算机程序,还可以执行如下操作:在异常PCIe端口对应的故障类型为不可恢复故障的情况下,禁用异常PCIe端口及异常PCIe端口的通信链路。In a possible design, the processor can also perform the following operations by calling the computer program stored in the memory: in the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, disable the abnormal PCIe port and the abnormal PCIe port. communication link.
在一种可能的设计中,可恢复故障可以包括如下内容中的一项或多项:数据链路层包传输超时错误、事务层包写配置空间的重试次数过多错误、两比特数据错误、先进可扩展接口AXI总线的响应错误。In one possible design, recoverable failures may include one or more of the following: data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error , The response of the advanced extensible interface AXI bus is wrong.
在一种可能的设计中,异常PCIe端口对应的异常中断信息可以包括如下内容中的一项或多项:异常PCIe端口的标识、异常PCIe端口对应的故障类型、异常PCIe端口所在的PCIe核的标识、PCIe核连接的中央处理器CPU的标识。In a possible design, the abnormal interrupt information corresponding to the abnormal PCIe port may include one or more of the following: the identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, and the information of the PCIe core where the abnormal PCIe port is located. ID, the ID of the central processing unit CPU connected to the PCIe core.
在一种可能的设计中,该故障处理装置还可以包括通信接口,处理器通过调用存储器中存储的计算机程序,具体执行如下操作:处理器通过通信接口接收异常PCIe端口对应的异常中断信息,将异常PCIe端口对应的异常中断信息添加在预设工作队列,从预设工作队列中获取异常PCIe端口对应的异常中断信息。其中,预设工作队列用于存储各PCIe核中的异常PCIe端口对应的异常中断信息。In a possible design, the fault handling device may further include a communication interface, and the processor specifically performs the following operations by calling the computer program stored in the memory: the processor receives the abnormal interrupt information corresponding to the abnormal PCIe port through the communication interface, and The abnormal interrupt information corresponding to the abnormal PCIe port is added to the preset work queue, and the abnormal interrupt information corresponding to the abnormal PCIe port is obtained from the preset work queue. The preset work queue is used to store abnormal interrupt information corresponding to abnormal PCIe ports in each PCIe core.
第三方面,本申请提供一种故障处理装置,该装置包括执行上述任一方面的任意一种可能的设计的方法的模块、单元或电路。这些模块、单元或电路可以通过硬件实现,也可以通过硬件执行相应的软件实现。In a third aspect, the present application provides a fault handling apparatus, the apparatus including a module, a unit or a circuit for performing any one of the possible design methods of any of the above aspects. These modules, units or circuits can be implemented by hardware, or by executing corresponding software by hardware.
第四方面,本申请提供一种芯片,该芯片可以包括处理器和通信接口,处理器用于通过通信接口读取指令,以执行如上述第一方面任一项所述的故障处理方法。In a fourth aspect, the present application provides a chip, which may include a processor and a communication interface, where the processor is configured to read an instruction through the communication interface, so as to execute the fault handling method according to any one of the above first aspects.
第五方面,本申请提供一种故障处理系统,包括中央处理器和外围组件互连传递PCIe核,PCIe核包括根复合体和至少一个PCIe节点,中央处理器连接根复合体,根复合体中 包括至少一个PCIe端口,根复合体通过至少一个PCIe端口连接至少一个PCIe节点。其中,根复合体可以用于生成至少一个PCIe端口中的异常PCIe端口对应的异常中断信息并上报给中央处理器,中央处理器可以用于按照如上述第一方面任一项所述的故障处理方法对异常PCIe端口进行故障处理。In a fifth aspect, the present application provides a fault handling system, including a central processing unit and a peripheral component interconnection and transmission PCIe core, the PCIe core includes a root complex and at least one PCIe node, the central processing unit is connected to the root complex, and the root complex is in the root complex. At least one PCIe port is included, and the root complex is connected to at least one PCIe node through the at least one PCIe port. The root complex can be used to generate abnormal interrupt information corresponding to an abnormal PCIe port in at least one PCIe port and report it to the central processing unit, and the central processing unit can be used for troubleshooting according to any one of the above first aspects. Method to troubleshoot the abnormal PCIe port.
第六方面,本申请提供一种计算机可读存储介质,该计算机可读介质存储有程序代码,当程序代码在计算机上运行时,使得计算机执行如上述第一方面任一项所述的故障处理方法。In a sixth aspect, the present application provides a computer-readable storage medium, the computer-readable medium stores a program code, when the program code is run on a computer, the computer is made to perform the fault handling as described in any one of the above-mentioned first aspects. method.
第七方面,本申请提供一种计算机程序产品,包括计算机程序代码,当计算机程序代码在计算机上运行时,使得计算机执行如上述第一方面任一项所述的故障处理方法。In a seventh aspect, the present application provides a computer program product, including computer program code, which, when the computer program code is run on a computer, causes the computer to execute the fault handling method according to any one of the above-mentioned first aspect.
本申请上述第二方面至第七方面中任一项所对应的有益效果,具体可以参照上述第一方面中任一项所述的有益效果,此处不再重复赘述。For the beneficial effects corresponding to any one of the above-mentioned second aspect to the seventh aspect of the present application, reference may be made to the beneficial effect described in any one of the above-mentioned first aspect, which will not be repeated here.
附图说明Description of drawings
图1示例性示出本申请实施例适用的一种系统架构示意图;FIG. 1 exemplarily shows a schematic diagram of a system architecture to which an embodiment of the present application is applicable;
图2示例性示出本申请实施例提供的一种故障处理方法的流程示意图;FIG. 2 exemplarily shows a schematic flowchart of a fault handling method provided by an embodiment of the present application;
图3示例性示出本申请实施例提供的一种复位方法对应的流程示意图;FIG. 3 exemplarily shows a schematic flowchart corresponding to a reset method provided by an embodiment of the present application;
图4示例性示出本申请实施例提供的一种故障处理逻辑的软硬件架构示意图;FIG. 4 exemplarily shows a schematic diagram of a software and hardware architecture of a fault processing logic provided by an embodiment of the present application;
图5示例性示出本申请实施例提供的另一种故障处理方法的流程示意图;FIG. 5 exemplarily shows a schematic flowchart of another fault processing method provided by an embodiment of the present application;
图6示例性示出本申请实施例提供的一种故障处理装置的结构示意图;FIG. 6 exemplarily shows a schematic structural diagram of a fault processing apparatus provided by an embodiment of the present application;
图7示例性示出本申请实施例提供的另一种故障处理装置的结构示意图。FIG. 7 exemplarily shows a schematic structural diagram of another fault processing apparatus provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请所公开的故障处理方法可以应用于基于PCIe系统进行通信的电子设备中。在本申请一些实施例中,故障处理装置可以是电子设备或一个独立的单元。当故障处理装置是一个独立的单元时,该单元可以嵌入在电子设备中,并能对该电子设备的PCIe端口进行故障处理,以维持PCIe系统的高可靠性。在本申请另一些实施例中,故障处理装置也可以是封装在电子设备内部的单元,用于实现该电子设备的PCIe端口的故障处理功能。电子设备可以是服务器、存储器、测试仪器、或包含诸如个人数字助理和/或音乐播放器等功能的便携式电子设备,如手机、平板电脑、具备无线通讯功能的可穿戴设备(如智能手表)、或车载设备等。便携式电子设备的示例性实施例包括但不限于搭载
Figure PCTCN2021073396-appb-000001
Figure PCTCN2021073396-appb-000002
或者其它操作系统的便携式电子设备,如具有触敏表面(例如触控面板)的膝上型计算机(Laptop)或台式计算机等。
The fault handling method disclosed in this application can be applied to an electronic device that communicates based on a PCIe system. In some embodiments of the present application, the fault handling device may be an electronic device or an independent unit. When the fault handling device is an independent unit, the unit can be embedded in the electronic equipment, and can perform fault handling on the PCIe port of the electronic equipment, so as to maintain the high reliability of the PCIe system. In other embodiments of the present application, the fault processing apparatus may also be a unit packaged inside the electronic device, and is used to implement the fault processing function of the PCIe port of the electronic device. The electronic device may be a server, memory, test instrument, or a portable electronic device containing functions such as personal digital assistants and/or music players, such as mobile phones, tablet computers, wearable devices with wireless communication capabilities (such as smart watches), or in-vehicle equipment, etc. Exemplary embodiments of portable electronic devices include, but are not limited to, carry-on
Figure PCTCN2021073396-appb-000001
Figure PCTCN2021073396-appb-000002
Or portable electronic devices with other operating systems, such as laptops (Laptops) or desktop computers with touch-sensitive surfaces (eg, touch panels).
下面将结合附图对本申请作进一步地详细描述。需要说明的是,在本申请的描述中“至少一个”是指一个或多个,其中,多个是指两个或两个以上。鉴于此,本发明实施例中也可以将“多个”理解为“至少两个”。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,字符“/”,如无特殊说明,一般表示前后关联对象是一种“或”的关系。The present application will be described in further detail below with reference to the accompanying drawings. It should be noted that, in the description of the present application, "at least one" refers to one or more, wherein a plurality of refers to two or more. In view of this, in the embodiment of the present invention, "a plurality" may also be understood as "at least two". "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/", unless otherwise specified, generally indicates that the related objects are an "or" relationship.
图1示例性示出本申请实施例适用的一种系统架构示意图。如图1所示,该系统架构中包括至少一个中央处理器(central processing unit,CPU)和至少一个PCIe核,至少一 个中央处理器和至少一个PCIe核可以一一对应,如图1所示出的中央处理器1与PCIe核1对应,中央处理器2与PCIe核2对应。PCIe核也称为PCIe系统。每个PCIe核中可以包括一个根复合体和至少一个端节点,还可以包括至少一个交换节点和至少一个桥节点。其中,根复合体用于在构建PCIe核时进行系统初始化并配置各节点之间的通信链路,以将PCIe核所对应的中央处理器与PCIe核中的交换节点、端节点和桥节点中的一个或多个进行一一连接。与PCIe核对应的中央处理器可通过连接该PCIe核中的根复合体,以实现与该PCIe核中的各节点的通信交互。交换节点分别连接上游的根复合体和下游的端节点、交换节点或桥节点中的一个或多个,用于把上游的根复合体的数据路由至下游的一个或多个节点,或分别将下游的每个节点的数据路由到上游唯一的根复合体,或还可以通过点对点的方式灵活地将下游的某一节点的数据路由到下游的另一节点。桥节点用于通过不同总线系统中设置的非透明桥(non-transparent bridge,NTB),实现PCIe核与采用其它总线标准的其它PCI或其它PCIe核的通信连接。端节点通常位于终端应用(Application,APP)内,负责连接终端APP与PCIe核内的其它节点并完成基于PCIe的事务传输。一般来说,一个PCIe核中的端节点数量比其它类型的节点数量要多。FIG. 1 exemplarily shows a schematic diagram of a system architecture to which the embodiments of the present application are applicable. As shown in FIG. 1 , the system architecture includes at least one central processing unit (CPU) and at least one PCIe core, and at least one central processing unit and at least one PCIe core may correspond one-to-one, as shown in FIG. 1 The CPU 1 corresponds to the PCIe core 1, and the CPU 2 corresponds to the PCIe core 2. PCIe cores are also known as PCIe systems. Each PCIe core may include one root complex and at least one end node, and may also include at least one switch node and at least one bridge node. Among them, the root complex is used to initialize the system and configure the communication links between the nodes when constructing the PCIe core, so as to connect the CPU corresponding to the PCIe core with the switching nodes, end nodes and bridge nodes in the PCIe core. One or more of them are connected one by one. The central processing unit corresponding to the PCIe core can communicate with each node in the PCIe core by connecting to the root complex in the PCIe core. The switching node connects the upstream root complex with one or more of the downstream end nodes, switching nodes or bridge nodes, respectively, and is used to route the data of the upstream root complex to one or more downstream nodes, or respectively The data of each downstream node is routed to the upstream unique root complex, or the data of a downstream node can be flexibly routed to another downstream node in a point-to-point manner. The bridge node is used to realize the communication connection between the PCIe core and other PCI or other PCIe cores adopting other bus standards through non-transparent bridges (NTB) set in different bus systems. The end node is usually located in a terminal application (Application, APP), and is responsible for connecting the terminal APP with other nodes in the PCIe core and completing PCIe-based transaction transmission. In general, there are more end nodes in a PCIe core than other types of nodes.
下面以图1所示的PCIe核1为例,进一步介绍每个PCIe核中的节点连接方式:The following takes PCIe core 1 shown in Figure 1 as an example to further introduce the node connection method in each PCIe core:
PCIe核1中包括一个根复合体1、一个交换节点1、一个桥节点1和四个端节点,即端节点1、端节点2、端节点3、端节点4。其中,交换节点1、端节点1和桥节点1属于根复合体1的下游节点(根复合体1属于交换节点1、端节点1和桥节点1的上游节点),而端节点2、端节点3和端节点4属于交换节点1的下游节点(交换节点1属于端节点2、端节点3和端节点4的上游节点)。上游节点和对应的下游节点可通过PCI总线(参见图1所示出的粗黑色线)进行连接。根复合体1中还可以包含一个或多个PCIe端口,如根端口(root Port,RP)1、RP2和RP3。根复合体1可通过RP1、RP2和RP3分别连接下游的交换节点1、端节点1和桥节点1,如此,根复合体1可通过RP1实现与交换节点1、及其下游的端节点2、端节点3和端节点4之间的数据路由,可通过RP2实现与端节点1之间的数据路由,可通过RP3实现与PCIe核2中的节点(如桥节点2)的数据路由。The PCIe core 1 includes a root complex 1 , a switch node 1 , a bridge node 1 and four end nodes, namely end node 1 , end node 2 , end node 3 , and end node 4 . Among them, switch node 1, end node 1 and bridge node 1 belong to the downstream nodes of root complex 1 (root complex 1 belongs to the upstream nodes of switch node 1, end node 1 and bridge node 1), while end node 2, end node 3 and end node 4 belong to the downstream node of switch node 1 (switch node 1 belongs to the upstream node of end node 2, end node 3 and end node 4). Upstream nodes and corresponding downstream nodes can be connected by a PCI bus (see the thick black line shown in Figure 1). The root complex 1 may also contain one or more PCIe ports, such as root ports (root Port, RP) 1, RP2, and RP3. The root complex 1 can connect the downstream switching node 1, end node 1 and bridge node 1 through RP1, RP2 and RP3 respectively. In this way, the root complex 1 can communicate with the switching node 1, its downstream end node 2, and the downstream end node 1 through RP1. For data routing between end node 3 and end node 4, data routing with end node 1 can be implemented through RP2, and data routing with a node in PCIe core 2 (eg, bridge node 2) can be implemented through RP3.
需要说明的是,上述内容只是一种示例性的介绍,PCIe核中也可以包括比图1所示意的更多或更少的节点,如可以包括更多数量的交换节点、端节点或桥节点,或者包括除根复合体、交换节点、端节点和桥节点以外的其它类型的节点。根复合体与中央处理器可以按照图1所示意的方式一对一连接,也可以按照一对多或多对一的方式连接,如一个根复合体也可以分别连接至少两个中央处理器,或一个中央处理器也可以分别连接至少两个根复合体等。此外,中央处理器与PCIe核可以部署于同一物理实体,也可以分别部署于不同的物理实体,还可以是中央处理器与PCIe核中的一部分节点部署于同一物理实体,PCIe核中的另一部分节点部署于另一物理实体,具体不作限定。It should be noted that the above content is only an exemplary introduction, and the PCIe core may also include more or less nodes than those shown in FIG. 1 , such as a greater number of switch nodes, end nodes or bridge nodes. , or include other types of nodes than root complexes, switch nodes, end nodes, and bridge nodes. The root complex and the central processing unit can be connected one-to-one in the manner shown in Figure 1, or can be connected in a one-to-many or many-to-one manner. For example, a root complex can also be connected to at least two central processing units, respectively. Alternatively, one central processing unit can also be connected to at least two root complexes, etc., respectively. In addition, the central processing unit and the PCIe core can be deployed in the same physical entity, or they can be deployed in different physical entities, or the central processing unit and a part of the PCIe core nodes can be deployed in the same physical entity, and another part of the PCIe core can be deployed in the same physical entity. The node is deployed in another physical entity, which is not specifically limited.
基于图1所示意的内容可知,根复合体通过内部设置的各PCIe端口实现与PCIe核中的其它节点的通信连接,在这种情况下,根复合体与终端APP之间的业务实际上分发在各PCIe端口所对应的通信链路上以进行处理。因此,保证各PCIe端口与各节点之间的连接正常性,对于维持整个PCIe核的业务处理能力至关重要。现阶段,在检测到某一PCIe端口与下游的端节点之间出现连接问题时,通常的做法是重启整个PCIe核。但这种方式会让PCIe核中的全部PCIe端口都处于不可用状态,不仅无法恢复存在连接问题的PCIe端口的业务,还会影响到PCIe核中的其它PCIe端口的业务,降低整个PCIe核甚至整个系 统的可靠性。为解决这个问题,在一种可选地实施方式中,也可以不重启PCIe核,而是只禁用与下游端节点存在连接问题的PCIe端口。但是,这种方式又会让被禁用的PCIe端口及挂载在该PCIe端口上的全部下游节点都处于不可用状态,虽然不会影响到其它PCIe端口的业务,但是被禁用的PCIe端口的业务却长时间无法恢复。此外,上述两种方式都只能对下游节点为端节点时的连接故障进行处理,而无法处理下游节点为其它类型节点(如交换节点或桥节点)时的连接故障,导致故障处理的通用性较差。Based on the content shown in Figure 1, it can be seen that the root complex realizes the communication connection with other nodes in the PCIe core through each PCIe port set internally. In this case, the services between the root complex and the terminal APP are actually distributed Processing is performed on the communication link corresponding to each PCIe port. Therefore, ensuring the normality of the connection between each PCIe port and each node is crucial to maintaining the service processing capability of the entire PCIe core. At this stage, when a connection problem between a PCIe port and a downstream end node is detected, the usual practice is to restart the entire PCIe core. However, this method will make all PCIe ports in the PCIe core in an unavailable state, which not only cannot restore the services of the PCIe ports with connection problems, but also affects the services of other PCIe ports in the PCIe core, reducing the entire PCIe core and even reliability of the entire system. To solve this problem, in an optional implementation manner, the PCIe core may not be restarted, but only the PCIe port that has a connection problem with the downstream end node is disabled. However, this method will make the disabled PCIe port and all downstream nodes mounted on the PCIe port in an unavailable state. Although the services of other PCIe ports will not be affected, the services of the disabled PCIe port will not be affected. but could not recover for a long time. In addition, the above two methods can only deal with the connection failure when the downstream node is an end node, but cannot deal with the connection failure when the downstream node is other types of nodes (such as switching nodes or bridge nodes), which leads to the generality of fault handling. poor.
有鉴于此,本申请提供一种故障处理方法,用以在不影响其它PCIe端口的业务的情况下,快速恢复异常PCIe端口的业务,并进而实现对更多类型的下游节点的连接故障进行处理。In view of this, the present application provides a fault handling method for quickly recovering the services of an abnormal PCIe port without affecting the services of other PCIe ports, and further realizing the processing of connection failures of more types of downstream nodes .
下面通过具体的实施例来介绍本申请中故障处理方案的具体实现过程。The specific implementation process of the fault handling solution in the present application is described below through specific embodiments.
【实施例一】[Example 1]
图2示例性示出本申请实施例提供的一种故障处理方法的流程示意图,该方法适用于图1中的中央处理器,如图1所示意的中央处理器1或中央处理器2。如图2所示,该方法包括:FIG. 2 exemplarily shows a schematic flowchart of a fault processing method provided by an embodiment of the present application, and the method is applicable to the central processing unit in FIG. 1 , such as the central processing unit 1 or the central processing unit 2 shown in FIG. 1 . As shown in Figure 2, the method includes:
步骤201,中央处理器获取异常PCIe端口对应的异常中断信息。 Step 201, the central processing unit acquires abnormal interrupt information corresponding to the abnormal PCIe port.
在一种可选地实施方式中,根复合体可以实时或周期检测内部各PCIe端口及挂载在各PCIe端口上的下游节点之间的业务处理情况,当检测到某一PCIe端口及挂载的下游节点之间的业务处理出现异常时,为避免继续执行异常的当前业务导致PCIe核的业务准确性受到影响,根复合体可以先中断该异常PCIe端口的当前业务,再根据本次故障的相关信息生成该异常PCIe端口对应的异常中断信息,最后上报给所连接的中央处理器。其中,异常PCIe端口对应的异常中断信息中可以包含故障类型、异常PCIe端口的标识、以及异常PCIe端口所在的PCIe核的标识,还可以包含当前业务的类型、业务处理进度或联系人员信息中的一项或多项。In an optional implementation manner, the root complex can detect, in real time or periodically, the service processing status between each internal PCIe port and the downstream nodes mounted on each PCIe port. When the service processing between the downstream nodes is abnormal, in order to prevent the service accuracy of the PCIe core from being affected by continuing to execute the abnormal current service, the root complex can first interrupt the current service of the abnormal PCIe port, and then according to the current service of the fault. The relevant information generates abnormal interrupt information corresponding to the abnormal PCIe port, and finally reports it to the connected central processing unit. The abnormal interrupt information corresponding to the abnormal PCIe port may include the fault type, the identification of the abnormal PCIe port, and the identification of the PCIe core where the abnormal PCIe port is located, and may also include the type of the current service, the progress of service processing, or the information in the contact person information. one or more.
在上述实施方式中,PCIe核中的各个接口均为PCIe,而中央处理器与根复合体之间的接口不是PCIe,不属于PCIe核。在这种情况下,根复合体可以通过非PCIe总线的方式将异常中断信息传输给中央处理器,如通过文件传输协议(file transfer protocol,FTP)进行传输。In the above embodiment, each interface in the PCIe core is PCIe, and the interface between the central processing unit and the root complex is not PCIe and does not belong to the PCIe core. In this case, the root complex can transmit the abort information to the central processor through a non-PCIe bus, such as through the file transfer protocol (FTP).
本申请实施例中,异常PCIe端口的故障类型可以包括如下内容中的一种或多种:In this embodiment of the present application, the fault type of the abnormal PCIe port may include one or more of the following:
错误类型一:两比特数据错误Error type 1: Two-bit data error
根复合体内还可以存储有纠错程序(error correcting code,ECC),ECC能纠正1个比特错误(属于一种可纠正错误(correctable errors,CE))和检测2个比特错误(属于一种不可纠正错误(uncorrectable errors,UE)),即能将只存在1个比特错误的数据纠正为正确的数据,且能检测出存在2个比特错误的数据但不能纠正。在实施中,根复合体在检测到某一PCIe端口的业务处理出现异常时,可以先调用ECC对该错误进行预处理,如果只有1个比特的数据发生错误,则根复合体能按照ECC指出的方式自行定位出错误的1个比特数据并能直接纠正为正确的数据,之后继续执行PCIe端口的当前业务,这种情况下的PCIe端口仍为正常PCIe端口。然而,如果有2个比特的数据发生错误,则根复合体只能检测出哪两个比特错误但无法自行纠正,这种情况下的PCIe端口为异常PCIe端口,根复合体可以先暂停该PCIe端口的当前业务,再基于ECC定位出的2比特错误的位置组装对应的 异常中断信息并上报给中央控制器。在这种情况下,根复合体可将生成异常中断信息中的故障类型标注为“两比特数据错误”。An error correcting code (ECC) can also be stored in the root complex. ECC can correct 1-bit errors (belonging to a correctable error (CE)) and detect 2-bit errors (belonging to an irreversible error). To correct errors (uncorrectable errors, UE)), data with only one bit error can be corrected into correct data, and data with two bit errors can be detected but cannot be corrected. In implementation, when the root complex detects that the service processing of a PCIe port is abnormal, it can call the ECC to preprocess the error first. The method locates the wrong 1-bit data by itself and can directly correct it to the correct data, and then continues to execute the current service of the PCIe port. In this case, the PCIe port is still a normal PCIe port. However, if there are 2 bits of data in error, the root complex can only detect which two bits are wrong but cannot correct itself. The PCIe port in this case is an abnormal PCIe port, and the root complex can suspend the PCIe port first. The current service of the port, and then assemble the corresponding abnormal interruption information based on the position of the 2-bit error located by the ECC and report it to the central controller. In this case, the root complex may label the failure type in the generated abort information as "two-bit data error".
错误类型二:数据链路层包传输超时错误Error type 2: Data link layer packet transmission timeout error
根复合体可以由应用层、事务传输层(transport layer,TL)、数据链路层(data link layer,DLL)和物理层构成。在执行中央处理器与终端APP之间的事务处理时,应用层会先向事务传输层发起事务传输请求,事务传输层生成对应的事务传输层包(transport layer package,TLP)并发送给数据链路层,数据链路层为该TLP增加一个序列号和链路循环冗余校验(link cyclic redundancy check,LCRC)码,以生成对应的数据链路层包(data link layer package,DLLP)并发送给物理层,由在物理层该PCIe端口所对应的PCIe链路上传输该事务。数据链路层发出DLLP后还会等待接收物理层返回成功传输的响应信息,只有在预设时长内接收到该响应信息,数据链路层才会确认数据链路层与物理层之间的双向事务接收正确。按照该事务处理流程,在实施中,如果根复合体中的数据链路层在发出DLLP之后超过预设时长还未接收到物理层返回的成功传输的响应信息,则根复合体可以确认数据链路层与物理层的传输出现问题,该问题可能是由于网络延时所导致的一种不可纠正错误,该种情况下的PCIe端口为异常PCIe端口。根复合体可针对于该PCIe端口生成该PCIe端口对应的异常中断信息,并可将故障类型标注为“数据链路层包传输超时错误”。The root complex can be composed of an application layer, a transaction transport layer (TL), a data link layer (DLL) and a physical layer. When performing transaction processing between the central processor and the terminal APP, the application layer will first initiate a transaction transmission request to the transaction transport layer, and the transaction transport layer will generate the corresponding transaction transport layer package (transport layer package, TLP) and send it to the data link The data link layer adds a serial number and link cyclic redundancy check (LCRC) code to the TLP to generate the corresponding data link layer package (DLLP) and It is sent to the physical layer, and the transaction is transmitted on the PCIe link corresponding to the PCIe port in the physical layer. After the data link layer sends the DLLP, it will wait for the response information from the physical layer to return a successful transmission. Only when the response information is received within the preset time period, the data link layer will confirm the bidirectional connection between the data link layer and the physical layer. The transaction was received correctly. According to the transaction processing flow, in the implementation, if the data link layer in the root complex has not received the response information of successful transmission returned by the physical layer for more than a preset period of time after sending the DLLP, the root complex can confirm the data link There is a problem in the transmission between the road layer and the physical layer. The problem may be an uncorrectable error caused by network delay. In this case, the PCIe port is an abnormal PCIe port. The root complex may generate abnormal interrupt information corresponding to the PCIe port for the PCIe port, and may mark the failure type as "data link layer packet transmission timeout error".
错误类型三:写配置空间的重试次数过多错误Error type 3: Too many retries to write configuration space error
PCIe核中的一个节点(如根复合体、交换节点或端节点)最多可以支持8个功能,例如音频、视频等功能。在一个节点同时支持多个功能时,该节点的每个功能都对应有自己的配置空间,配置空间中存储着该功能的相关信息。配置空间可以是节点中的一段独立的存储单元,例如配置空间的大小可以是256k。除根复合体之外的其它节点只能看到自己的配置空间的相关信息,而根复合体具有读写各节点的配置空间的权限。如根复合体可通过事务层包读取任一节点的配置空间中的信息以确定该节点所支持的功能,或可通过事务层包写任一节点的配置空间以完成对该节点的初始化和功能配置。然而,如果被写入的节点还没有做好准备响应根复合体的写配置空间的请求,则被写入的节点会向根复合体返回状态为“配置重试状态(configuration retry status,CRS)”的事务层响应包。这标志着根复合体未能成功写该节点的配置空间。在未能写入的次数不超过预设次数时,该写入失败属于一种可纠正错误,根复合体还可以不断进行重写。当接收到一定数量的CRS状态的事务层响应包时,说明根复合体在一定次数的重写中一直未能成功写入,该异常节点所在的PCIe链路上的PCIe端口发生不可纠正错误,该种情况下的PCIe端口为异常PCIe端口。根复合体可针对于该PCIe端口生成对应的异常中断信息,并可将故障类型标注为“数据链路层包传输超时错误”。A node in a PCIe core (such as a root complex, switch node, or end node) can support up to 8 functions, such as audio, video, and more. When a node supports multiple functions at the same time, each function of the node has its own configuration space, and the relevant information of the function is stored in the configuration space. The configuration space may be an independent storage unit in the node, for example, the size of the configuration space may be 256k. Other nodes except the root complex can only see the relevant information of their own configuration space, and the root complex has the permission to read and write the configuration space of each node. For example, the root complex can read the information in the configuration space of any node through the transaction layer package to determine the functions supported by the node, or can write the configuration space of any node through the transaction layer package to complete the initialization and initialization of the node. Functional configuration. However, if the node being written to is not ready to respond to the root complex's request to write the configuration space, the node being written to will return the status to the root complex as "configuration retry status (CRS)" ” transaction layer response packet. This indicates that the root complex failed to successfully write to the node's configuration space. When the number of failures to write does not exceed a preset number of times, the failure to write is a correctable error, and the root complex can continue to be rewritten. When a certain number of transaction layer response packets in the CRS state are received, it indicates that the root complex has not been successfully written in a certain number of rewrites, and an uncorrectable error has occurred in the PCIe port on the PCIe link where the abnormal node is located. The PCIe port in this case is an abnormal PCIe port. The root complex can generate corresponding abnormal interrupt information for the PCIe port, and can mark the fault type as "data link layer packet transmission timeout error".
错误类型四:AXI总线响应错误Error type 4: AXI bus response error
PCIe核中的根复合体与对应的中央处理器通过先进可扩展接口(advanced eXtensible interface,AXI)总线连接。意味着,根复合体与中央处理器需要按照AXI总线协议所规定的方式发送数据和接收数据。在该种情况下,根复合体通过AXI总线接收到中央处理器下发的针对于某一PCIe端口的数据处理请求后,如果不能在AXI总线协议所规定的时间之内向中央处理器返回响应,或返回的响应不符合AXI总线协议规定的格式,则根复合体可确定该PCIe端口出现问题。该问题可能是由于PCIe端口异常而产生的一种不可纠正错误,也可能是由于PCIe端口与下游节点之间的传输链路异常所产生的一种不可纠正错误, 该种情况下的PCIe端口为异常PCIe端口。根复合体可针对于该PCIe端口生成对应的异常中断信息,并可将故障类型标注为“AXI总线响应错误”。应理解,上述内容只是以根复合体检测AXI总线响应错误为例介绍这种错误类型,在其它示例中,AXI总线响应错误还可由中央处理器自行检测,而无需由根复合体上报。The root complex in the PCIe core is connected with the corresponding central processing unit through an advanced eXtensible interface (AXI) bus. It means that the root complex and the central processing unit need to send and receive data in the manner specified by the AXI bus protocol. In this case, after the root complex receives the data processing request for a certain PCIe port issued by the central processing unit through the AXI bus, if it cannot return a response to the central processing unit within the time specified by the AXI bus protocol, Or the returned response does not conform to the format specified by the AXI bus protocol, the root complex can determine that there is a problem with the PCIe port. The problem may be an uncorrectable error caused by the abnormality of the PCIe port, or an uncorrectable error caused by the abnormality of the transmission link between the PCIe port and the downstream node. In this case, the PCIe port is Abnormal PCIe port. The root complex can generate corresponding abnormal interrupt information for the PCIe port, and can mark the fault type as "AXI bus response error". It should be understood that the above content only takes the root complex detection of AXI bus response errors as an example to introduce this type of error. In other examples, AXI bus response errors can also be detected by the central processing unit without being reported by the root complex.
需要说明的是,上述只是示例性列出几种可能的故障类型,异常PCIe端口的故障类型还可以为任意其它端口级别的不可纠正错误。此外,异常PCIe端口的故障可能是由该PCIe端口及下游节点的硬件故障而导致的,也可能是由该PCIe端口及下游节点的软件故障而导致的,本申请对此不作具体限定。It should be noted that the above only exemplarily lists several possible fault types, and the fault type of an abnormal PCIe port may also be any other port-level uncorrectable error. In addition, the fault of the abnormal PCIe port may be caused by the hardware fault of the PCIe port and the downstream node, or may be caused by the software fault of the PCIe port and the downstream node, which is not specifically limited in this application.
步骤202,中央处理器根据异常PCIe端口对应的异常中断信息确定故障类型:在故障类型为不可恢复故障的情况下,执行步骤203;在故障类型为可恢复故障的情况下,执行步骤204。 Step 202, the central processing unit determines the fault type according to the abnormal interrupt information corresponding to the abnormal PCIe port: if the fault type is an unrecoverable fault, step 203 is performed; if the fault type is a recoverable fault, step 204 is performed.
本申请实施例中,不可恢复故障是指中央处理器无法直接或间接修复而必须由专人调测才能解决的故障,如硬件损坏或记录媒体缺陷等。可恢复故障是指中央处理器可直接或间接修复的故障,如可通过复位、升级、更新、下载补丁或重启等方式进行修复的故障。In the embodiment of the present application, an unrecoverable fault refers to a fault that cannot be repaired directly or indirectly by the central processing unit and must be solved only by special personnel debugging, such as hardware damage or recording medium defect. Recoverable faults refer to faults that can be repaired directly or indirectly by the central processing unit, such as faults that can be repaired by reset, upgrade, update, download patch or restart.
步骤203,中央处理器禁用异常PCIe端口及对应的通信链路。 Step 203, the CPU disables the abnormal PCIe port and the corresponding communication link.
本申请实施例中,中央处理器中还可以存储有一个可恢复故障记录表,该可恢复故障记录表中记录有中央处理器能直接或间接恢复的可恢复故障类型。可恢复故障记录表中的可恢复故障,例如可以包括上述内容示意出的两比特数据错误、数据链路层包传输超时错误、写配置空间的重试次数过多错误、AXI总线响应错误、或其它可恢复的端口级错误中的一种或多种。这些可恢复故障类型可以是研发人员根据经验预置在可恢复故障记录表中的,也可以是中央处理器在执行业务的过程中自行学习摸索并实时存储在可恢复故障记录表中的,还可以是中央处理器从其它中央处理器或网络设备等获取到的交互信息等得到的,具体不作限定。In the embodiment of the present application, the central processing unit may further store a recoverable fault record table, and the recoverable fault record table records the recoverable fault types that the central processing unit can recover directly or indirectly. The recoverable faults in the recoverable fault record table may include, for example, two-bit data errors indicated above, data link layer packet transmission timeout errors, too many retries for writing configuration space, AXI bus response errors, or One or more of the other recoverable port-level errors. These recoverable failure types can be preset in the recoverable failure record table by R&D personnel based on their experience, or they can be learned and explored by the central processing unit in the process of executing the business and stored in the recoverable failure record table in real time. It may be obtained by the central processing unit from the interaction information obtained by other central processing units or network devices, etc., which is not specifically limited.
在实施中,中央处理器可以先从异常中断信息中获取异常PCIe端口的故障类型,再将异常PCIe端口的故障类型与可恢复故障记录表中的可恢复故障类型进行匹配,若可恢复故障记录表中的全部可恢复故障类型都与异常PCIe端口的故障类型不匹配,则中央处理器可将异常PCIe端口的故障定位为不可恢复故障。由于不可恢复故障既不能用非计算机程序或运行的技术恢复,也无法由错误校验码或其它技术加以校正,因此,中央处理器一旦检测到某一PCIe端口出现不可恢复故障,则可以直接禁用该PCIe端口,以节省PCIe核的资源,并避免异常PCIe端口的不可恢复故障扩散到整个PCIe核而导致整个PCIe核的业务故障的现象发生。In implementation, the central processing unit may first obtain the fault type of the abnormal PCIe port from the abnormal interrupt information, and then match the fault type of the abnormal PCIe port with the recoverable fault type in the recoverable fault record table. All recoverable fault types in the table do not match the fault types of the abnormal PCIe port, and the central processing unit may locate the fault of the abnormal PCIe port as an unrecoverable fault. Since unrecoverable faults cannot be recovered by non-computer programs or operating techniques, nor can they be corrected by error checking codes or other techniques, once the CPU detects an unrecoverable fault on a PCIe port, it can directly disable it. The PCIe port is used to save the resources of the PCIe core and avoid the phenomenon that the unrecoverable fault of the abnormal PCIe port spreads to the entire PCIe core and causes the service failure of the entire PCIe core to occur.
本申请实施例中,中央处理器可以通过多种方式禁用异常PCIe端口及对应的PCIe链路,示例来说:In the embodiment of the present application, the central processing unit may disable the abnormal PCIe port and the corresponding PCIe link in various ways, for example:
一种方式下,中央处理器可以通过热键或指令等方式调出基本输入输出系统(basic input output system,BIOS)的配置界面,在该配置界面上选中异常PCIe端口并下发禁用命令,该禁用命令会使根复合体按照指令指示或写配置空间等方式禁用异常PCIe端口的端口功能以及挂载在异常PCIe端口上的所有节点的节点功能;In one way, the central processing unit can call up the configuration interface of the basic input output system (BIOS) through hot keys or instructions, select the abnormal PCIe port on the configuration interface and issue a disable command, the The disable command will cause the root complex to disable the port function of the abnormal PCIe port and the node functions of all nodes mounted on the abnormal PCIe port according to the instruction instruction or by writing the configuration space;
另一种方式下,中央处理器可以调出主板中的RAM芯片(如互补金属氧化物半导体(Complementary Metal Oxide Semiconductor,CMOS)芯片)的板载设置界面,将板载设置界面中的异常PCIe端口的插槽设置为“Disabled”,以使根复合体失效异常PCIe端口, 进而卸载异常PCIe端口上挂载的所有节点。In another way, the central processing unit can call up the onboard setting interface of the RAM chip (such as a complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) chip) in the motherboard, and set the abnormal PCIe port in the onboard setting interface The slot is set to "Disabled", so that the root complex fails the abnormal PCIe port, thereby unloading all nodes mounted on the abnormal PCIe port.
示例性地,当异常中断信息中还包含联系人员信息时,中央处理器在禁用PCIe端口及对应的PCIe链路之后,还可以生成对应的告警消息,并按照异常中断信息中的联系人员信息将告警信息推送给用户,以便于用户及时获知并修复故障的PCIe端口,尽快恢复PCIe端口的业务。Exemplarily, when the abnormal interruption information also includes contact person information, after disabling the PCIe port and the corresponding PCIe link, the central processing unit can also generate a corresponding alarm message, and according to the contact person information in the abnormal interruption information The alarm information is pushed to the user, so that the user can know and repair the faulty PCIe port in time, and restore the services of the PCIe port as soon as possible.
步骤204,中央处理器重置异常PCIe端口及对应的通信链路。 Step 204, the CPU resets the abnormal PCIe port and the corresponding communication link.
本申请实施例中,中央处理器中还可以存储有可恢复故障记录表中记录的每个可恢复故障的重置方式,任一可恢复故障的重置方式可以为复位、升级、更新、下载补丁或重启中的一项或多项。在实施中,若可恢复故障记录表中存在与异常PCIe端口的故障类型相匹配的目标可恢复故障类型,则中央处理器可将异常PCIe端口的故障定位为可恢复故障,并可以使用目标可恢复故障类型对应的重置方式重置异常PCIe端口及异常PCIe端口对应的通信链路。如此,中央处理器不仅能通过重置的方式尽快恢复异常PCIe端口的业务,还能不影响其它PCIe端口的业务,有效维持PCIe核的可靠性。In the embodiment of the present application, the central processing unit may also store the reset method of each recoverable fault recorded in the recoverable fault record table, and the reset method of any recoverable fault may be reset, upgrade, update, download One or more of a patch or reboot. In implementation, if there is a target recoverable fault type matching the fault type of the abnormal PCIe port in the recoverable fault record table, the central processing unit may locate the fault of the abnormal PCIe port as a recoverable fault, and may use the target recoverable fault The reset method corresponding to the recovery fault type resets the abnormal PCIe port and the communication link corresponding to the abnormal PCIe port. In this way, the central processing unit can not only restore the services of the abnormal PCIe port as soon as possible by resetting, but also can maintain the reliability of the PCIe core without affecting the services of other PCIe ports.
需要说明的是,一般情况下,只有PCIe核中的根复合体能直接写PCIe核中的端口和节点的配置空间,而中央处理器无法直接写PCIe核中的端口或节点的配置空间,因此,为实现对异常PCIe端口及对应的通信链路的重置,该方式可以预先将写PCIe核中的端口和节点的配置空间的权限同步给中央处理器,以通过中央处理器直接写配置空间的方式完成复位,也可以通过中央处理器向根复合体发送相应的指令驱使根复合体写配置空间的间接方式实现复位,具体不作限定。It should be noted that, in general, only the root complex in the PCIe core can directly write the configuration space of the ports and nodes in the PCIe core, while the CPU cannot directly write the configuration space of the ports or nodes in the PCIe core. Therefore, In order to reset the abnormal PCIe port and the corresponding communication link, this method can synchronize the permission to write the configuration space of the port and the node in the PCIe core to the central processing unit in advance, so that the central processing unit can directly write the configuration space of the configuration space through the central processing unit. The reset can also be accomplished by an indirect method in which the central processing unit sends a corresponding instruction to the root complex to drive the root complex to write the configuration space, which is not specifically limited.
下面以重置方式为复位为例,介绍一种重置异常PCIe端口及对应的通信链路的具体实现方式。A specific implementation manner of resetting an abnormal PCIe port and a corresponding communication link is described below by taking the reset manner as a reset as an example.
图3示例性示出本申请实施例提供的一种复位方法对应的流程示意图,该方法适用于中央处理器,如图1所示意的中央处理器1或中央处理器2。在该示例中,假设中央处理器通过间接方式写PCIe核中的端口和节点的配置空间。如图3所示,该方法包括:FIG. 3 exemplarily shows a schematic flowchart corresponding to a reset method provided by an embodiment of the present application, and the method is applicable to a central processing unit, such as the central processing unit 1 or the central processing unit 2 shown in FIG. 1 . In this example, it is assumed that the CPU indirectly writes the configuration space of ports and nodes in the PCIe core. As shown in Figure 3, the method includes:
步骤301,中央处理器停用异常PCIe端口。 Step 301, the CPU deactivates the abnormal PCIe port.
在上述步骤301中,中央处理器可以向根复合体发送针对于异常PCIe端口的停用指令,以指示根复合体不再继续该异常PCIe端口的当前业务。根复合体还可以在停用异常PCIe端口后,记录异常PCIe端口在停用之前的业务处理进度,以便于恢复后继续执行业务。通过在重置异常PCIe端口之前先停用异常PCIe端口,能避免在重置异常PCIe端口的过程中影响到所执行的业务,保证重置异常PCIe端口前后业务执行的准确性。In the above step 301, the central processing unit may send a deactivation instruction for the abnormal PCIe port to the root complex, so as to instruct the root complex not to continue the current service of the abnormal PCIe port. The root complex can also record the service processing progress of the abnormal PCIe port before the deactivation after deactivating the abnormal PCIe port, so as to continue to execute the service after recovery. By deactivating the abnormal PCIe port before resetting the abnormal PCIe port, it can avoid affecting the services executed during the process of resetting the abnormal PCIe port, and ensure the accuracy of service execution before and after resetting the abnormal PCIe port.
步骤302,中央处理器断开异常PCIe端口与挂载在异常PCIe端口上的节点之间的通信链路。 Step 302, the central processing unit disconnects the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port.
在上述步骤302中,中央处理器可以向根复合体发送针对于异常PCIe端口的移除指令,以驱使根复合体写异常PCIe端口上挂载的各节点的配置空间,将异常PCIe端口上挂载的各节点从当前通信链路上移除,断开异常PCIe端口与异常PCIe端口上挂载的各节点之间的连接关系。通过在重置异常PCIe端口之前先移除异常PCIe端口上挂载的各节点,能解耦异常PCIe端口与PCIe核中的其它节点,有助于实现异常PCIe端口的独立复位。In the above step 302, the central processing unit may send a removal instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of each node mounted on the abnormal PCIe port, and attach the abnormal PCIe port to the The loaded nodes are removed from the current communication link, and the connection relationship between the abnormal PCIe port and each node mounted on the abnormal PCIe port is disconnected. By removing each node mounted on the abnormal PCIe port before resetting the abnormal PCIe port, the abnormal PCIe port can be decoupled from other nodes in the PCIe core, which is helpful to realize the independent reset of the abnormal PCIe port.
步骤303,中央处理器失效异常PCIe端口。 Step 303, the CPU fails the abnormal PCIe port.
在上述步骤303中,中央处理器可以向根复合体发送针对于异常PCIe端口的失效指令,以驱使根复合体写异常PCIe端口的配置空间,将异常PCIe端口从使能(enable)状 态切换至去使能(disable)状态。其中,“使能”是指“使…能”。当异常PCIe端口处于使能状态时,异常PCIe端口具有对上游的根复合体下发的数据或下游节点上报的数据的处理能力。当异常PCIe端口处于去使能状态时,异常PCIe端口不具有对上游的根复合体下发的数据或下游节点上报的数据的处理能力。In the above step 303, the central processing unit may send an invalidation instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of the abnormal PCIe port, and switch the abnormal PCIe port from the enable state to To enable (disable) state. Wherein, "enable" means "enable". When the abnormal PCIe port is in the enabled state, the abnormal PCIe port has the ability to process the data sent by the upstream root complex or the data reported by the downstream node. When the abnormal PCIe port is in the disabled state, the abnormal PCIe port does not have the ability to process the data sent by the upstream root complex or the data reported by the downstream node.
步骤304,中央处理器复位PCIe核的介质访问控制(media access control,MAC)逻辑。 Step 304, the central processing unit resets the media access control (media access control, MAC) logic of the PCIe core.
在上述步骤304中,中央处理器可以向根复合体发送针对于异常PCIe端口的MAC复位指令,以驱使根复合体初始化整个PCIe核的MAC逻辑,恢复异常PCIe端口的MAC层通信机制。通过只复位与端口相关的MAC层逻辑,而不复位跟端口无关的其它逻辑,能有针对性的对异常PCIe端口所出现的异常进行恢复,在提高PCIe端口重置的效率的基础上,节省中央处理器和PCIe核的处理资源。In the above step 304, the central processing unit may send a MAC reset command for the abnormal PCIe port to the root complex, so as to drive the root complex to initialize the MAC logic of the entire PCIe core and restore the MAC layer communication mechanism of the abnormal PCIe port. By only resetting the MAC layer logic related to the port without resetting other logic unrelated to the port, the abnormality of the abnormal PCIe port can be recovered in a targeted manner, and on the basis of improving the efficiency of PCIe port reset, saving Processing resources of the CPU and PCIe cores.
步骤305,中央处理器复位PCIe核对应的SerDes链路参数。 Step 305, the CPU resets the SerDes link parameter corresponding to the PCIe core.
本申请实施例中,PCIe核中的各条通信链路由串行器/解调器(serializer/deserializer,SerDes)进行管理,SerDes中预设有默认的SerDes链路参数,SerDes按照默认的SerDes链路参数将并行的发送数据转换成串行的发送数据,或按照默认的SerDes链路参数将串行的接收数据转换成并行的接收数据。然而,当周围的环境发生变化时,默认的SerDes链路参数可能不再适应于PCIe核,导致PCIe核中的部分PCIe端口所在的通信链路出现异常。在这种情况下,则需要调节SerDes中的SerDes链路参数,以适应当前环境。In the embodiment of the present application, each communication link in the PCIe core is managed by a serializer/deserializer (SerDes). The SerDes is preset with default SerDes link parameters, and the SerDes follows the default SerDes The link parameters convert parallel transmit data to serial transmit data, or convert serial receive data to parallel receive data according to the default SerDes link parameters. However, when the surrounding environment changes, the default SerDes link parameters may no longer be suitable for the PCIe core, resulting in an abnormality in the communication link where some PCIe ports in the PCIe core are located. In this case, the SerDes link parameters in the SerDes need to be adjusted to suit the current environment.
在实施中,中央处理器可以向根复合体发送针对于异常PCIe端口的链路复位指令,以驱使根复合体自适应的校准异常PCIe端口对应的SerDes链路参数。其中,校准后的SerDes链路参数可以是获取当前的环境参数(如温度)并代入预设公式计算得到的,也可以是按照调整后的执行效果进行闭环反馈调节直至调至收敛而得到的,还可以是随机选取的,具体不作限定。In implementation, the central processing unit may send a link reset command for the abnormal PCIe port to the root complex, so as to drive the root complex to adaptively calibrate SerDes link parameters corresponding to the abnormal PCIe port. The calibrated SerDes link parameters may be obtained by obtaining the current environmental parameters (such as temperature) and substituting them into the preset formula for calculation, or may be obtained by performing closed-loop feedback adjustment according to the adjusted execution effect until it converges. It may also be randomly selected, which is not specifically limited.
步骤306,中央处理器重建异常PCIe端口与挂载在异常PCIe端口上的节点之间的通信链路。 Step 306, the central processing unit rebuilds the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port.
在上述步骤306中,中央处理器可以向根复合体发送针对于整个PCIe核的重建链指令,以使根复合体重新构建PCIe核的整个拓扑结构,也可以向根复合体发送只针对于异常PCIe端口的重建链指令,以使根复合体直接检测异常PCIe端口及挂载在异常PCIe端口上的全部节点,进而添加到已有的PCIe核的拓扑结构中。In the above-mentioned step 306, the central processing unit may send a rebuild chain instruction for the entire PCIe core to the root complex, so that the root complex rebuilds the entire topology of the PCIe core, or it may send an instruction to the root complex only for exceptions The rebuild chain instruction of the PCIe port enables the root complex to directly detect the abnormal PCIe port and all nodes mounted on the abnormal PCIe port, and then add it to the topology of the existing PCIe core.
在一种可选地实施方式中,继续参照图1所示,假设要重建链整个PCIe核1,则根复合体1可以依次遍历内部设置的各个PCIe端口RP1、RP2和RP3所在的总线路径:In an optional implementation manner, continuing to refer to FIG. 1 , assuming that the entire PCIe core 1 of the chain is to be rebuilt, the root complex 1 can sequentially traverse the bus paths where the PCIe ports RP1, RP2, and RP3 set internally are located:
根复合体1遍历RP1的下游总线,发现交换节点1,为交换节点1分配总线地址(可包括总线号、节点号和功能号(bus device and function number,BDF)等);根复合体1按照深度优先规则继续遍历交换节点1的下游总线所连接的节点,发现端节点2,为端节点2分配总线地址;由于端节点2不存在下游总线,因此根复合体1继续遍历交换节点1的下游总线所连接的节点,发现端节点3,为端节点3分配总线地址;由于端节点3不存在下游总线,因此根复合体1继续遍历交换节点1的下游总线所连接的节点,发现端节点4,为端节点4分配总线地址;至此,RP1所在的总线链路遍历完成;Root complex 1 traverses the downstream bus of RP1, finds switching node 1, and assigns a bus address to switching node 1 (which may include bus number, node number, and function number (bus device and function number, BDF), etc.); root complex 1 follows The depth-first rule continues to traverse the nodes connected to the downstream bus of switch node 1, finds end node 2, and assigns a bus address to end node 2; since end node 2 does not have a downstream bus, root complex 1 continues to traverse the downstream of switch node 1. The node connected to the bus finds end node 3 and assigns a bus address to end node 3; since end node 3 does not have a downstream bus, root complex 1 continues to traverse the nodes connected to the downstream bus of switch node 1 and finds end node 4 , assign the bus address to the end node 4; so far, the bus link traversal where RP1 is located is completed;
根复合体1遍历RP2的下游总线,发现端节点1,为端节点1分配总线地址;至此,RP2所在的总线链路遍历完成;Root complex 1 traverses the downstream bus of RP2, finds end node 1, and assigns a bus address to end node 1; so far, the traversal of the bus link where RP2 is located is completed;
根复合体1遍历RP3的下游总线,发现桥节点1,为桥节点1分配总线地址;根复合体1同时记录桥节点1的桥地址,以便于与PCIe核2遍历得到的拓扑结构建立关联;至此,RP3所在的总线链路遍历完成。Root complex 1 traverses the downstream bus of RP3, finds bridge node 1, and assigns a bus address to bridge node 1; root complex 1 records the bridge address of bridge node 1 at the same time, so as to establish an association with the topology obtained by PCIe core 2 traversal; So far, the traversal of the bus link where RP3 is located is completed.
通过上述流程,根复合体1可以为PCIe核1中的所有节点分配总线地址,并能构建整个PCIe核1的拓扑结构。Through the above process, the root complex 1 can allocate bus addresses to all nodes in the PCIe core 1, and can construct the topology of the entire PCIe core 1.
需要说明的是,由于只有之前被移除的异常PCIe端口上挂载的节点不在PCIe核所对应的拓扑结构中,而除异常PCIe端口以外的其它PCIe端口上挂载的节点本身就存在于PCIe核所对应的拓扑结构中,因此,即使是针对于整个PCIe核进行重建链,根复合体所遍历到的非异常PCIe端口上挂载的节点也不用再重新添加到PCIe核的拓扑结构中,而是可以只添加当前拓扑结构中没有的异常PCIe端口上挂载的节点,从而通过补充链路的方式完成对异常PCIe端口与挂载在异常PCIe端口上的节点之间的通信链路的重建。It should be noted that since only the nodes mounted on the abnormal PCIe ports that were removed before are not in the topology corresponding to the PCIe core, the nodes mounted on other PCIe ports other than the abnormal PCIe ports themselves exist in PCIe In the topology structure corresponding to the core, therefore, even if the chain is reconstructed for the entire PCIe core, the nodes mounted on the non-abnormal PCIe ports traversed by the root complex do not need to be re-added to the topology structure of the PCIe core. Instead, only the nodes mounted on the abnormal PCIe port that are not in the current topology can be added, so as to complete the reconstruction of the communication link between the abnormal PCIe port and the node mounted on the abnormal PCIe port by supplementing the link. .
步骤307,中央处理器生效异常PCIe端口。 Step 307 , the CPU activates the abnormal PCIe port.
在上述步骤307中,中央处理器可以向根复合体发送针对于异常PCIe端口的生效指令,以驱使根复合体写异常PCIe端口的配置空间,将异常PCIe端口从disable状态切换至去使能enable状态,以恢复异常PCIe端口对上游的根复合体下发的数据或下游节点上报的数据的处理能力。In the above step 307, the central processing unit may send an effective instruction for the abnormal PCIe port to the root complex, so as to drive the root complex to write the configuration space of the abnormal PCIe port, and switch the abnormal PCIe port from the disabled state to the disabled enable state to restore the abnormal PCIe port's ability to process the data sent by the upstream root complex or the data reported by the downstream node.
步骤308,中央处理器启用异常PCIe端口。 Step 308, the CPU enables the abnormal PCIe port.
在上述步骤308中,中央处理器可以向根复合体发送针对于异常PCIe端口的启用指令,以指示根复合体启用该异常PCIe端口的业务处理。其中,根复合体可以重新执行异常PCIe端口的当前业务,也可以从之前停用时所记录的业务处理进度处继续执行异常PCIe端口的当前业务,以在节省PCIe核的处理资源的同时提高业务处理的效率,还可以不再执行当前业务,而是等待后续新的业务到来后直接执行新的业务,具体不作限定。In the above step 308, the central processing unit may send an enable instruction for the abnormal PCIe port to the root complex, so as to instruct the root complex to enable the service processing of the abnormal PCIe port. Among them, the root complex can re-execute the current service of the abnormal PCIe port, and can also continue to execute the current service of the abnormal PCIe port from the service processing progress recorded when it was deactivated, so as to save the processing resources of the PCIe core and improve the service at the same time. For processing efficiency, the current service may not be executed, but the new service may be directly executed after the subsequent new service arrives, which is not specifically limited.
在上述实施例一中,通过重置可恢复的异常PCIe端口及对应的通信链路,不仅能及时恢复异常PCIe端口的可用性,保持PCIe端口的通信服务能力,还无需重启整个PCIe核,如此,对异常PCIe端口的处理还不会影响到其它PCIe端口的业务处理,从而有助于维持PCIe系统的高可靠性。更进一步的,该方案还能对PCIe核中出现的任一PCIe节点(如端节点、交换节点或桥节点)的任意可恢复故障进行处理,而不只局限于端节点或某一故障类型,因此该方式还能有效检测并修复PCIe核中的各类故障,进一步提高整个PCIe核的高可靠性。In the above embodiment 1, by resetting the recoverable abnormal PCIe port and the corresponding communication link, not only the availability of the abnormal PCIe port can be restored in time, but the communication service capability of the PCIe port can be maintained, and the entire PCIe core need not be restarted. The processing of the abnormal PCIe port will not affect the service processing of other PCIe ports, thereby helping to maintain the high reliability of the PCIe system. Further, this solution can also handle any recoverable faults of any PCIe node (such as end node, switch node or bridge node) that appear in the PCIe core, not just the end node or a certain type of failure, so This method can also effectively detect and repair various faults in the PCIe core, further improving the high reliability of the entire PCIe core.
需要说明的是,上述实施例一实际上是以中央处理器作为执行主体为例来介绍故障处理的具体实现过程,这只是一种可选地实施方式。在另一种可选地实施方式中,故障处理方案也可以直接由根复合体来执行。在该实施方式中,根复合体若检测到内部的某一PCIe端口出现异常,则可以直接重置该PCIe端口及对应的通信链路,而不需要再上报给中央处理器进行处理。该实施方式虽然会增加根复合体的工作压力,但是能节省PCIe核与中央处理器的通信开销,且能更快地处理故障。It should be noted that, the above-mentioned first embodiment actually takes the central processing unit as the execution body as an example to introduce the specific implementation process of the fault handling, which is only an optional implementation manner. In an alternative embodiment, the fault handling scheme can also be executed directly by the root complex. In this embodiment, if the root complex detects that an internal PCIe port is abnormal, the PCIe port and the corresponding communication link can be directly reset without reporting to the central processing unit for processing. Although this implementation increases the working pressure of the root complex, it can save the communication overhead between the PCIe core and the central processing unit, and can handle faults faster.
本申请实施例中,中央处理器可以运行在Linux操作系统中。Linux操作系统中的虚拟内存空间被划分为内核空间(kernel space)和用户空间(user space)。内核空间是Linux操作系统中内核的运行空间,而用户空间是用户程序的运行空间。内核空间和用户空间是相互隔离的,即使用户程序崩溃,内核空间也不会受到影响。In this embodiment of the present application, the central processing unit may run in a Linux operating system. The virtual memory space in the Linux operating system is divided into kernel space (kernel space) and user space (user space). Kernel space is the running space of the kernel in the Linux operating system, while user space is the running space of user programs. Kernel space and user space are isolated from each other, and even if the user program crashes, the kernel space will not be affected.
基于Linux操作系统,图4示例性示出本申请实施例提供的一种故障处理逻辑的软硬件架构示意图,如图4所示,故障处理方案在硬件上可由设置于芯片硬件中的中央处理器和根复合体交互完成。芯片硬件还可连接其它外设,如存储器、输入输出设备或驱动设备等。中央处理器中封装有故障处理的软件整体逻辑,该软件整体逻辑由可靠性、可用性和可服务性(reliability、availability and serviceability,RAS)固件(firmware)、PCIe驱动、高级配置和电源管理接口(advanced configuration and power management interface,ACPI)和SerDes固件构成。其中,RAS固件或SerDes固件的程序可以预先写入存储器,RAS固件主要用于处理中断,SerDes固件主要用于重建链。中央处理器可通过调用存储器中的RAS固件或SerDes固件,执行RAS固件或SerDes固件所对应的方法逻辑。ACPI中定义了操作系统、BIOS以及系统硬件之间的各种工作接口。ACPI可以在BIOS或系统硬件中被实现,并可以被操作系统调用或触发。PCIe驱动位于Linux系统的内核空间,用于管理PCIe核中各端口的启用或停用、以及管理各端口和各节点的连接关系。PCIe驱动可开源于社区。Based on the Linux operating system, FIG. 4 exemplarily shows a schematic diagram of the software and hardware architecture of a fault processing logic provided by an embodiment of the present application. As shown in FIG. 4 , the fault processing solution can be implemented in hardware by a central processing unit provided in the chip hardware. Interaction with the root complex is done. Chip hardware can also be connected to other peripherals, such as memory, input and output devices, or drive devices. The overall logic of the software for fault handling is encapsulated in the central processing unit. advanced configuration and power management interface, ACPI) and SerDes firmware. Among them, the program of the RAS firmware or SerDes firmware can be pre-written into the memory, the RAS firmware is mainly used for processing interrupts, and the SerDes firmware is mainly used for rebuilding the chain. The central processing unit may execute the method logic corresponding to the RAS firmware or the SerDes firmware by calling the RAS firmware or the SerDes firmware in the memory. ACPI defines various working interfaces between the operating system, BIOS and system hardware. ACPI can be implemented in the BIOS or system hardware and can be invoked or triggered by the operating system. The PCIe driver is located in the kernel space of the Linux system, and is used to manage the enabling or disabling of each port in the PCIe core and the connection relationship between each port and each node. PCIe drivers can be open sourced to the community.
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory in this embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) ) and direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.
应理解,图4只是一种示例性地介绍,在其它示例中,中央处理器也可以与根复合体设置于不同的芯片硬件。此外,存储器与中央处理器可以处于同一物理实体(如芯片硬件),也可以分别处于不同的物理实体。It should be understood that FIG. 4 is only an exemplary introduction, and in other examples, the central processing unit and the root complex may also be provided in different chip hardware. In addition, the memory and the central processing unit may be located in the same physical entity (eg, chip hardware), or may be located in different physical entities.
下面从上述软硬件架构的执行角度,以实施例二进一步介绍故障处理方案的具体实现。The specific implementation of the fault handling solution is further introduced in Embodiment 2 from the perspective of the execution of the above-mentioned software and hardware architecture.
【实施例二】[Example 2]
图5示例性示出本申请实施例提供的另一种故障处理方法的流程示意图,该方法适用于芯片硬件、RAS固件、ACPI、PCIe驱动和SerDes固件,芯片硬件上设置有中央处理器和异常PCIe端口,PCIe驱动位于内核空间,内核空间与用户空间存在交互。如图5所示,该方法包括:FIG. 5 exemplarily shows a schematic flowchart of another fault handling method provided by an embodiment of the present application. The method is applicable to chip hardware, RAS firmware, ACPI, PCIe driver, and SerDes firmware. The chip hardware is provided with a central processing unit and an exception. PCIe ports, PCIe drivers are located in the kernel space, and the kernel space interacts with the user space. As shown in Figure 5, the method includes:
步骤501,异常PCIe端口发生故障时,触发RAS中断。Step 501, when an abnormal PCIe port fails, trigger a RAS interrupt.
在上述步骤501中,当异常PCIe端口的故障是可纠正故障时,该故障在触发RAS中断之前就可以被根复合体修复。然而,当异常PCIe端口的故障是不可纠正故障时,由于根复合体无法修复该故障,因此该故障会触发根复合体中断异常PCIe端口的当前业务,并向中央处理器上报本次RAS中断的相关信息,如异常PCIe端口的标识(core port ID,如编号)、故障类型(error type)和异常PCIe端口所在的PCIe核的标识(core ID,如编号) 等。In the above step 501, when the fault of the abnormal PCIe port is a correctable fault, the fault can be repaired by the root complex before triggering the RAS interruption. However, when the fault of the abnormal PCIe port is an uncorrectable fault, since the root complex cannot repair the fault, the fault will trigger the root complex to interrupt the current service of the abnormal PCIe port and report the current RAS interruption to the CPU. Relevant information, such as the identification of the abnormal PCIe port (core port ID, such as number), the fault type (error type), and the identification of the PCIe core where the abnormal PCIe port is located (core ID, such as number), etc.
步骤502,RAS固件根据RAS中断,调用ACPI生成对应的ACPI中断事件并上报给PCIe驱动。Step 502, the RAS firmware calls ACPI to generate a corresponding ACPI interrupt event according to the RAS interrupt and reports it to the PCIe driver.
在上述步骤502中,中央处理器会调用RAS固件对RAS中断进行处理,根据本次RAS中断的相关信息和中央处理器的相关信息(如标识)等生成对应的ACPI中断事件,并上报给内核空间中的PCIe驱动。In the above step 502, the central processing unit will call the RAS firmware to process the RAS interrupt, generate the corresponding ACPI interrupt event according to the relevant information of the RAS interrupt and the relevant information (such as the identification) of the central processing unit, and report it to the kernel PCIe driver in space.
步骤503,PCIe驱动根据ACPI中断事件生成异常PCIe端口对应的故障信息,并将该故障信息添加到工作队列(即预设工作队列)。Step 503 , the PCIe driver generates fault information corresponding to the abnormal PCIe port according to the ACPI interrupt event, and adds the fault information to a work queue (ie, a preset work queue).
在上述步骤503中,PCIe驱动会对与PCIe核相关的ACPI中断事件进行响应,首先从ACPI中断事件提取出异常PCIe端口的标识、故障类型和异常PCIe端口所在的PCIe核的标识,然后根据这些信息以及该PCIe核所设置的芯片的标识(socket ID,如编号)按照设定的消息结构组装成故障信息,进而添加到工作队列。工作队列位于内核空间,用于存储PCIe核中出现的各个故障信息。在实施中,中央处理器可以在内核空间中为各个PCIe核设置各自对应的工作队列,任一中央处理器中的PCIe驱动将中央处理器所连接的PCIe核中出现的故障信息添加到该PCIe核对应的工作队列,或者,中央处理器也可以在内核空间中为所有的PCIe核设置同一个工作队列,各个中央处理器中的PCIe驱动将各个中央处理器所连接的PCIe核中出现的故障信息都添加到同一个工作队列,或者,中央处理器还可以在内核空间中为部分PCIe核设置一个工作队列,为另一部分PCIe核设置另一个工作队列等。In the above-mentioned step 503, the PCIe driver will respond to the ACPI interrupt event related to the PCIe core, firstly extract the identification of the abnormal PCIe port, the fault type and the identification of the PCIe core where the abnormal PCIe port is located from the ACPI interrupt event, and then according to these The information and the identification of the chip (socket ID, such as number) set by the PCIe core are assembled into fault information according to the set message structure, and then added to the work queue. The work queue is located in the kernel space and is used to store each fault information that occurs in the PCIe core. In implementation, the central processing unit may set a corresponding work queue for each PCIe core in the kernel space, and the PCIe driver in any central processing unit adds fault information that occurs in the PCIe core connected to the central processing unit to the PCIe core The work queue corresponding to the core, or, the central processing unit can also set the same work queue for all PCIe cores in the kernel space, and the PCIe driver in each central processing unit will record the faults that occur in the PCIe cores connected to each central processing unit. The information is all added to the same work queue, or the central processing unit may also set a work queue for some PCIe cores in the kernel space, another work queue for another part of the PCIe core, and so on.
步骤504,PCIe驱动从工作队列中取出一条故障信息,确定该故障信息中的故障类型是否属于可恢复故障,若否,执行步骤505,若是,则执行步骤506。Step 504 , the PCIe driver takes out a piece of fault information from the work queue, and determines whether the fault type in the fault information is a recoverable fault, if not, executes step 505 , if yes, executes step 506 .
在上述步骤504中,PCIe驱动可以按照先进先出的顺序从工作队列中取出故障信息,以按照出现故障的时间从早到晚的顺序依次恢复各PCIe端口,也可以按照先进后出的方式从工作队列中取出故障信息,以先对最近出现的PCIe端口故障进行恢复再对时间较早的PCIe端口故障进行恢复,还可以按照故障程度由重到轻的顺序依次从工作队列中取出故障信息,具体不作限定。In the above step 504, the PCIe driver can take out the fault information from the work queue in the order of first-in, first-out, so as to restore each PCIe port in order from early to late according to the time of the fault, or it can also start from the first-in-last-out method. The fault information is retrieved from the work queue to recover the most recent PCIe port fault first and then the earlier PCIe port fault. The fault information can also be retrieved from the work queue in order of the fault degree from heavier to lighter. There is no specific limitation.
示例性地,中央处理器中包括多个进程,如包括多个中央处理器核,PCIe驱动还可以调用多个中央处理器核综合处理各条故障信息。例如,PCIe驱动可以均衡地将各个故障信息预先分配给多个中央处理器核,也可以调用最为清闲的中央处理器核来处理待处理的一条故障信息,还可以将待处理的一条故障信息分配给最擅长处理当前业务的中央处理器核等,具体不作限定。Exemplarily, the central processing unit includes multiple processes, such as including multiple central processing unit cores, and the PCIe driver may also call the multiple central processing unit cores to comprehensively process each piece of fault information. For example, the PCIe driver can pre-allocate each fault information to multiple CPU cores in a balanced manner, or it can call the most idle CPU core to process a piece of fault information to be processed, and can also allocate a piece of fault information to be processed. It is not limited to the CPU core that is best at processing the current business.
步骤505,PCIe驱动将故障信息推送到用户空间。Step 505, the PCIe driver pushes the fault information to the user space.
在上述步骤505中,当PCIe驱动无法自行恢复异常PCIe的故障时,通过将该故障推送给用户,能及时通知给用户以便于及早进行人工维护,避免PCIe端口长时间处于不可用的状态。In the above step 505, when the PCIe driver cannot recover the abnormal PCIe fault by itself, by pushing the fault to the user, the user can be notified in time to facilitate early manual maintenance and prevent the PCIe port from being in an unavailable state for a long time.
示例性地,PCIe驱动每添加一条故障信息到工作队列时,也可以向用户空间中推送一条日志消息,以便于用户及时获知整个PCIe核的异常情况以及中央处理器的当前工作压力。Exemplarily, each time the PCIe driver adds a piece of fault information to the work queue, it can also push a log message to the user space, so that the user can know the abnormal situation of the entire PCIe core and the current working pressure of the central processing unit in time.
步骤506,PCIe驱动联合ACPI和SerDes固件执行故障处理的整体逻辑,该执行过程包括如下步骤:Step 506, the PCIe driver in conjunction with the ACPI and SerDes firmware executes the overall logic of fault handling, and the execution process includes the following steps:
步骤5061,PCIe驱动调用PCIe公共接口停用异常PCIe端口;Step 5061, the PCIe driver calls the PCIe public interface to disable the abnormal PCIe port;
步骤5062,PCIe驱动调用PCIe公共接口移除异常PCIe端口上挂载的全部节点;Step 5062, the PCIe driver calls the PCIe public interface to remove all nodes mounted on the abnormal PCIe port;
在上述步骤5061和步骤5062中,PCIe公共接口可位于公共寄存器,公共寄存器中的方法可开源于社区,且对其它用户可见。PCIe公共接口的实现方式可参照现有逻辑,此处不再赘述。In the above steps 5061 and 5062, the PCIe public interface may be located in a public register, and the methods in the public register may be open sourced to the community and visible to other users. The implementation manner of the PCIe public interface may refer to the existing logic, which will not be repeated here.
步骤5063,PCIe驱动调用ACPI接口中的端口复位方法。Step 5063, the PCIe driver invokes the port reset method in the ACPI interface.
在上述步骤5063中,端口复位方法按照ACPI接口规定的编程语言进行编程,并作为一个接口添加在ACPI接口逻辑中。如此,PCIe驱动通过调用端口复位方法对应的接口名称,即可执行对应的端口复位逻辑。其中,端口复位方法的接口名称可自行设置,如设置为RP reset。In the above step 5063, the port reset method is programmed according to the programming language specified by the ACPI interface, and added as an interface in the ACPI interface logic. In this way, the PCIe driver can execute the corresponding port reset logic by calling the interface name corresponding to the port reset method. Among them, the interface name of the port reset method can be set by yourself, for example, it is set to RP reset.
步骤5064,PCIe驱动按照端口复位方法,先复位PCIe核的MAC逻辑;Step 5064, the PCIe driver first resets the MAC logic of the PCIe core according to the port reset method;
步骤5065,PCIe驱动按照端口复位方法,再通过调用SerDes固件复位PCIe核对应的SerDes链路参数;Step 5065, the PCIe driver resets the SerDes link parameters corresponding to the PCIe core by calling the SerDes firmware according to the port reset method;
示例性地,端口复位方法可以位于芯片的公共寄存器中,PCIe驱动直接在芯片的公共寄存器中通过芯片内消息实现对PCIe公共接口和端口复位方法的调用,减少芯片间的消息传输通道。或者,端口复位方法也可以位于芯片的私有寄存器中,PCIe驱动在芯片的公共寄存器中调用PCIe公共接口后,再通过芯片内的消息调用方法调用芯片的私有寄存器中的端口复位方法,以隐私化端口复位方法。Exemplarily, the port reset method may be located in the common register of the chip, and the PCIe driver directly implements the invocation of the PCIe public interface and the port reset method in the common register of the chip through in-chip messages, thereby reducing message transmission channels between chips. Alternatively, the port reset method can also be located in the private register of the chip. After the PCIe driver calls the PCIe public interface in the public register of the chip, it then calls the port reset method in the private register of the chip through the message calling method in the chip to protect privacy. Port reset method.
步骤5066,PCIe驱动按照SerDes固件,复位PCIe核对应的SerDes链路参数;Step 5066, the PCIe driver resets the SerDes link parameters corresponding to the PCIe core according to the SerDes firmware;
在上述步骤5066中,SerDes固件可以按照任一操作语言进行编程,如C++、Phython等。在复位SerDes链路时,芯片内设置的SerDes链路会收到PCIe驱动的复位命令,之后会根据当前环境校准SerDes链路的参数,并根据校准后的参数的执行效果自适应地再次校准参数,以尽量将PCIe核中的异常PCIe端口对应的SerDes链路恢复到最佳状态。In the above step 5066, the SerDes firmware can be programmed according to any operating language, such as C++, Phython, and so on. When the SerDes link is reset, the SerDes link set in the chip will receive the reset command driven by the PCIe driver, and then the parameters of the SerDes link will be calibrated according to the current environment, and the parameters will be adaptively calibrated again according to the execution effect of the calibrated parameters. , in order to try to restore the SerDes link corresponding to the abnormal PCIe port in the PCIe core to the best state.
示例性地,SerDes固件可以位于芯片的私有寄存器中。由于PCIe驱动通过在端口复位方法中调用SerDes固件实现SerDes链路复位方法,因此,即使端口复位方法开源于社区,SerDes固件所提供的SerDes链路复位方法也对外不可见,有助于最大化地保证SerDes链路复位方法的安全性。Illustratively, the SerDes firmware may be located in a private register of the chip. Since the PCIe driver implements the SerDes link reset method by calling the SerDes firmware in the port reset method, even if the port reset method is open sourced to the community, the SerDes link reset method provided by the SerDes firmware is not visible to the outside world, which helps maximize the Ensure the security of the SerDes link reset method.
步骤5067,PCIe驱动确定SerDes固件调用完成,返回至端口复位方法;Step 5067, the PCIe driver determines that the SerDes firmware invocation is completed, and returns to the port reset method;
步骤5068,PCIe驱动确定端口复位方法调用完成,返回至PCIe公共接口;Step 5068, the PCIe driver determines that the port reset method call is completed, and returns to the PCIe public interface;
步骤5069,PCIe驱动通过枚举遍历的方式重建PCIe核的拓扑结构。Step 5069, the PCIe driver rebuilds the topology of the PCIe core by enumerating and traversing.
在上述实施例二中,通过芯片硬件、芯片固件(包括RAS固件和SerDes固件)、PCIe驱动以及ACPI的配合,中央处理器能通过软硬结合的方式完成对PCIe端口的故障感知、复位及业务恢复。在该方法中,对一个PCIe端口的故障的恢复不会影响其它PCIe端口的正常业务。且,即使将PCIe故障恢复的驱动开源到社区,也不会暴露设置在芯片私有寄存器中的SerDes固件,或端口复位方法和SerDes固件。通过尽量保护端口复位逻辑,能降低异常PCIe端口的复位流程被外界干扰的概率,提高复位的准确性。In the above-mentioned second embodiment, through the cooperation of chip hardware, chip firmware (including RAS firmware and SerDes firmware), PCIe driver and ACPI, the central processing unit can complete the fault perception, reset and service of the PCIe port through the combination of software and hardware. recover. In this method, the recovery of the failure of one PCIe port will not affect the normal services of other PCIe ports. Moreover, even if the PCIe fault recovery driver is open-sourced to the community, it will not expose the SerDes firmware set in the chip's private register, or the port reset method and SerDes firmware. By protecting the port reset logic as much as possible, the probability that the reset process of an abnormal PCIe port is disturbed by the outside world can be reduced, and the reset accuracy can be improved.
根据前述方法,图6为本申请实施例提供的故障处理装置600的结构示意图,该故障处理装置600可以为芯片或电路,比如可设置于中央处理器中的芯片或电路。该故障处理装置600可以对应上述方法中的中央处理器。该故障处理装置600可以实现如上图2至图5中所示的任一项或任多项对应的方法的步骤。如图6所示,该故障处理装置600可以包 括监测电路601和处理电路602。进一步的,该故障处理装置600还可以包括总线系统,监测电路601和处理电路602可以通过总线系统连接。且,监测电路601还可以通过总线系统连接PCIe核中的每个PCIe端口,处理电路602还可以通过总线系统连接PCIe核中的根复合体。According to the foregoing method, FIG. 6 is a schematic structural diagram of a fault processing apparatus 600 provided by an embodiment of the present application, and the fault processing apparatus 600 may be a chip or a circuit, such as a chip or a circuit that may be provided in a central processing unit. The fault processing device 600 may correspond to the central processing unit in the above method. The fault handling apparatus 600 may implement any one or more of the corresponding method steps as shown in FIG. 2 to FIG. 5 . As shown in FIG. 6 , the fault handling device 600 may include a monitoring circuit 601 and a processing circuit 602. Further, the fault processing device 600 may further include a bus system, and the monitoring circuit 601 and the processing circuit 602 may be connected through the bus system. Moreover, the monitoring circuit 601 can also be connected to each PCIe port in the PCIe core through the bus system, and the processing circuit 602 can also be connected to the root complex in the PCIe core through the bus system.
本申请实施例中,监测电路601可以接收异常PCIe端口上报的异常中断信息并发送给处理电路602。对应的,处理电路602可以先根据异常中断信息确定异常PCIe端口对应的故障类型,在异常PCIe端口对应的故障类型为可恢复故障的情况下,重置异常PCIe端口及异常PCIe端口的通信链路,以恢复异常PCIe端口与PCIe节点的连通关系。In this embodiment of the present application, the monitoring circuit 601 may receive the abnormal interrupt information reported by the abnormal PCIe port and send it to the processing circuit 602 . Correspondingly, the processing circuit 602 can first determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information, and reset the abnormal PCIe port and the communication link of the abnormal PCIe port when the fault type corresponding to the abnormal PCIe port is a recoverable fault. , to restore the connectivity between the abnormal PCIe port and the PCIe node.
该故障处理装置600所涉及的与本申请实施例提供的技术方案相关的概念,解释和详细说明及其他步骤请参见前述方法或其他实施例中关于这些内容的描述,此处不做赘述。For the concepts related to the technical solutions provided by the embodiments of the present application involved in the fault processing apparatus 600, please refer to the descriptions of the foregoing methods or other embodiments for explanations, detailed descriptions, and other steps, which will not be repeated here.
根据前述方法,图7为本申请实施例提供的又一种故障处理装置700的结构示意图,该故障处理装置700可以为芯片或电路,比如可设置于中央处理器中的芯片或电路。该故障处理装置700可以对应上述方法中的中央处理器。该故障处理装置700可以实现如上图2至图5中所示的任一项或任多项对应的方法的步骤。如图7所示,该故障处理装置700可以包括通信接口701、确定单元702和处理单元703。According to the foregoing method, FIG. 7 is a schematic structural diagram of another fault processing apparatus 700 provided by an embodiment of the present application. The fault processing apparatus 700 may be a chip or a circuit, such as a chip or a circuit that may be provided in a central processing unit. The fault processing device 700 may correspond to the central processing unit in the above method. The fault handling apparatus 700 may implement any one or more of the corresponding method steps shown in FIG. 2 to FIG. 5 . As shown in FIG. 7 , the fault processing apparatus 700 may include a communication interface 701 , a determination unit 702 and a processing unit 703 .
本申请实施例中,通信接口701在接收信息时可以为接收单元或接收器,此接收单元或接收器可以为射频电路。具体实施中,通信接口701可以接收异常PCIe端口上报的异常中断信息,确定单元702可以根据该异常中断信息确定异常PCIe端口对应的故障类型,在异常PCIe端口对应的故障类型为可恢复故障的情况下,处理单元703可以重置异常PCIe端口及异常PCIe端口的通信链路,以恢复异常PCIe端口与PCIe节点的连通关系。In this embodiment of the present application, the communication interface 701 may be a receiving unit or a receiver when receiving information, and the receiving unit or receiver may be a radio frequency circuit. In a specific implementation, the communication interface 701 can receive the abnormal interruption information reported by the abnormal PCIe port, and the determining unit 702 can determine the fault type corresponding to the abnormal PCIe port according to the abnormal interruption information. In the case where the fault type corresponding to the abnormal PCIe port is a recoverable fault Next, the processing unit 703 may reset the abnormal PCIe port and the communication link of the abnormal PCIe port to restore the connection relationship between the abnormal PCIe port and the PCIe node.
该故障处理装置700所涉及的与本申请实施例提供的技术方案相关的概念,解释和详细说明及其他步骤请参见前述方法或其他实施例中关于这些内容的描述,此处不做赘述。For the concepts related to the technical solutions provided by the embodiments of the present application involved in the fault processing apparatus 700, please refer to the descriptions of the foregoing methods or other embodiments for explanations and detailed descriptions and other steps, and will not be repeated here.
可以理解的是,上述故障处理装置700中各个单元的功能可以参考相应方法实施例的实现,此处不再赘述。It can be understood that, the functions of each unit in the above-mentioned fault processing apparatus 700 may refer to the implementation of the corresponding method embodiments, which will not be repeated here.
应理解,以上故障处理装置700的单元的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。本申请实施例中,通信接口701可以由上述图6的监测电路601实现,确定单元702和处理单元703可以由上述图6的处理电路602实现。It should be understood that the division of the units of the above fault processing apparatus 700 is only a division of logical functions, and in actual implementation, all or part of them may be integrated into one physical entity, or may be physically separated. In this embodiment of the present application, the communication interface 701 may be implemented by the monitoring circuit 601 in the foregoing FIG. 6 , and the determining unit 702 and the processing unit 703 may be implemented by the processing circuit 602 in the foregoing FIG. 6 .
根据本申请实施例提供的方法,本申请还提供一种故障处理系统,该故障处理系统包括上述内容中任一所述的中央处理器和PCIe核。其中,PCIe核中包括根复合体和至少一个PCIe节点,根复合体通过内部设置的至少一个PCIe端口连接下游的PCIe节点。中央处理器可以执行图1至图5所示实施例中任意一个实施例的方法,实现对至少一个PCIe端口中的异常PCIe端口的故障处理。According to the method provided by the embodiment of the present application, the present application further provides a fault processing system, where the fault processing system includes the central processing unit and the PCIe core described in any of the foregoing contents. The PCIe core includes a root complex and at least one PCIe node, and the root complex is connected to a downstream PCIe node through at least one PCIe port set inside. The central processing unit may execute the method of any one of the embodiments shown in FIG. 1 to FIG. 5 to implement fault processing for an abnormal PCIe port in the at least one PCIe port.
根据本申请实施例提供的方法,本申请还提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行图1至图5所示实施例中任意一个实施例的方法。According to the method provided by the embodiment of the present application, the present application also provides a computer program product, the computer program product includes: computer program code, when the computer program code is run on a computer, the computer is made to execute the steps shown in FIG. 1 to FIG. 5 . The method of any one of the illustrated embodiments.
根据本申请实施例提供的方法,本申请还提供一种计算机可读存储介质,该计算机可读介质存储有程序代码,当该程序代码在计算机上运行时,使得该计算机执行图1至图5所示实施例中任意一个实施例的方法。According to the method provided by the embodiment of the present application, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores program codes, and when the program codes are run on a computer, the computer is made to execute FIG. 1 to FIG. 5 . The method of any one of the illustrated embodiments.
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬 件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在两个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。The terms "component", "module", "system" and the like are used in this specification to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be components. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各种说明性逻辑块(illustrative logical block)和步骤(step),能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the various illustrative logical blocks and steps described in connection with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware accomplish. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (22)

  1. 一种故障处理方法,其特征在于,所述方法包括:A fault handling method, characterized in that the method comprises:
    获取异常外围组件互连传递PCIe端口对应的异常中断信息;Obtain the abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port;
    根据所述异常中断信息确定所述异常PCIe端口对应的故障类型;Determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information;
    在所述异常PCIe端口对应的故障类型为可恢复故障的情况下,重置所述异常PCIe端口及所述异常PCIe端口的通信链路;In the case that the fault type corresponding to the abnormal PCIe port is a recoverable fault, resetting the abnormal PCIe port and the communication link of the abnormal PCIe port;
    其中,所述异常PCIe端口的通信链路用于连通所述异常PCIe端口及PCIe节点。Wherein, the communication link of the abnormal PCIe port is used to connect the abnormal PCIe port and the PCIe node.
  2. 如权利要求1所述的方法,其特征在于,所述PCIe节点为端节点、交换节点或桥节点。The method of claim 1, wherein the PCIe node is an end node, a switch node or a bridge node.
  3. 如权利要求1或2所述的方法,其特征在于,所述重置所述异常PCIe端口,包括:The method according to claim 1 or 2, wherein the resetting the abnormal PCIe port comprises:
    复位所述异常PCIe端口所在的PCIe核的介质访问控制层MAC逻辑。The MAC logic of the medium access control layer of the PCIe core where the abnormal PCIe port is located is reset.
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述重置所述异常PCIe端口的通信链路,包括:The method according to any one of claims 1 to 3, wherein the resetting the communication link of the abnormal PCIe port comprises:
    复位所述异常PCIe端口所在的PCIe核对应的串行器/解调器SerDes链路参数。Reset the serializer/demodulator SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located.
  5. 如权利要求4所述的方法,其特征在于,The method of claim 4, wherein:
    所述复位所述异常PCIe端口所在的PCIe核对应的串行器/解调器SerDes链路参数之前,还包括:Before resetting the serializer/demodulator SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located, it also includes:
    断开所述异常PCIe端口与挂载在所述异常PCIe端口上的PCIe节点之间的通信链路;Disconnecting the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port;
    所述复位所述异常PCIe端口所在的PCIe核对应的串行器/解调器SerDes链路参数之后,还包括:After resetting the serializer/demodulator SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located, it also includes:
    重建所述异常PCIe端口与挂载在所述异常PCIe端口上的PCIe节点之间的通信链路。Rebuild the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port.
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 5, wherein the method further comprises:
    在所述异常PCIe端口对应的故障类型为不可恢复故障的情况下,禁用所述异常PCIe端口及所述异常PCIe端口的通信链路。In the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, the abnormal PCIe port and the communication link of the abnormal PCIe port are disabled.
  7. 如权利要求1至6中任一项所述的方法,其特征在于,所述可恢复故障包括如下内容中的一项或多项:The method according to any one of claims 1 to 6, wherein the recoverable fault includes one or more of the following:
    数据链路层包传输超时错误、事务层包写配置空间的重试次数过多错误、两比特数据错误、先进可扩展接口AXI总线的响应错误。Data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error, advanced extensible interface AXI bus response error.
  8. 如权利要求1至7中任一项所述的方法,其特征在于,所述异常PCIe端口对应的异常中断信息包括如下内容中的一项或多项:The method according to any one of claims 1 to 7, wherein the abnormal interrupt information corresponding to the abnormal PCIe port includes one or more of the following contents:
    所述异常PCIe端口的标识、所述异常PCIe端口对应的故障类型、所述异常PCIe端口所在的PCIe核的标识、所述PCIe核连接的中央处理器CPU的标识。The identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, the identifier of the PCIe core where the abnormal PCIe port is located, and the identifier of the central processing unit CPU to which the PCIe core is connected.
  9. 如权利要求1至8中任一项所述的方法,其特征在于,所述获取异常外围组件互连传递PCIe端口对应的异常中断信息,包括:The method according to any one of claims 1 to 8, wherein the acquiring abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port comprises:
    从预设工作队列中获取所述异常PCIe端口对应的异常中断信息;Obtain the abnormal interrupt information corresponding to the abnormal PCIe port from the preset work queue;
    其中,所述预设工作队列用于存储各PCIe核中的异常PCIe端口对应的异常中断信息。The preset work queue is used for storing abnormal interrupt information corresponding to abnormal PCIe ports in each PCIe core.
  10. 一种故障处理装置,其特征在于,包括处理器及存储器,所述存储器中存储有计算机程序;A fault handling device, characterized in that it comprises a processor and a memory, wherein a computer program is stored in the memory;
    所述处理器通过调用所述存储器中存储的所述计算机程序,执行如下操作:The processor performs the following operations by calling the computer program stored in the memory:
    获取异常外围组件互连传递PCIe端口对应的异常中断信息;Obtain the abnormal interrupt information corresponding to the abnormal peripheral component interconnection transmission PCIe port;
    根据所述异常中断信息确定所述异常PCIe端口对应的故障类型;Determine the fault type corresponding to the abnormal PCIe port according to the abnormal interrupt information;
    在所述异常PCIe端口对应的故障类型为可恢复故障的情况下,重置所述异常PCIe端口及所述异常PCIe端口的通信链路;In the case that the fault type corresponding to the abnormal PCIe port is a recoverable fault, resetting the abnormal PCIe port and the communication link of the abnormal PCIe port;
    其中,所述异常PCIe端口的通信链路用于连通所述PCIe端口及PCIe设备。Wherein, the communication link of the abnormal PCIe port is used to connect the PCIe port and the PCIe device.
  11. 如权利要求10所述的装置,其特征在于,所述PCIe节点为端节点、交换节点或桥节点。The apparatus of claim 10, wherein the PCIe node is an end node, a switch node or a bridge node.
  12. 如权利要求10或11所述的装置,其特征在于,所述故障处理装置还包括高级配置和电源管理接口ACPI;The device according to claim 10 or 11, wherein the fault handling device further comprises an advanced configuration and power management interface ACPI;
    所述处理器通过调用所述存储器中存储的所述计算机程序,具体执行如下操作:The processor specifically performs the following operations by calling the computer program stored in the memory:
    通过调用所述ACPI,执行:复位所述异常PCIe端口所在的PCIe核的介质访问控制层MAC逻辑。By calling the ACPI, execute: reset the MAC logic of the medium access control layer of the PCIe core where the abnormal PCIe port is located.
  13. 如权利要求12所述的装置,其特征在于,所述故障处理装置还包括串行解调器SerDes固件;The device of claim 12, wherein the fault handling device further comprises a serial demodulator SerDes firmware;
    所述处理器通过调用所述存储器中存储的所述计算机程序,还执行如下操作:The processor also performs the following operations by calling the computer program stored in the memory:
    通过调用所述ACPI,执行:复位所述异常PCIe端口所在的PCIe核的介质访问控制层MAC逻辑之后,调用所述SerDes固件;By calling the ACPI, execute: after resetting the medium access control layer MAC logic of the PCIe core where the abnormal PCIe port is located, calling the SerDes firmware;
    通过调用所述SerDes固件,执行:复位所述异常PCIe端口所在的PCIe核对应的SerDes链路参数。By calling the SerDes firmware, execute: reset the SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located.
  14. 如权利要求13所述的装置,其特征在于,所述故障处理装置还包括PCIe驱动;The apparatus of claim 13, wherein the fault handling apparatus further comprises a PCIe driver;
    所述处理器通过调用所述存储器中存储的所述计算机程序,还执行如下操作:The processor also performs the following operations by calling the computer program stored in the memory:
    通过调用所述PCIe驱动,执行:断开所述异常PCIe端口与挂载在所述异常PCIe端口上的PCIe节点之间的通信链路,并调用所述ACPI;By calling the PCIe driver, execute: disconnect the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port, and call the ACPI;
    通过调用所述SerDes固件,执行:复位所述异常PCIe端口所在的PCIe核对应的SerDes链路参数之后,返回调用所述PCIe驱动;By calling the SerDes firmware, execute: after resetting the SerDes link parameter corresponding to the PCIe core where the abnormal PCIe port is located, return to calling the PCIe driver;
    返回调用所述PCIe驱动之后,执行:重建所述异常PCIe端口与挂载在所述异常PCIe端口上的PCIe节点之间的通信链路。After returning to call the PCIe driver, execute: rebuild the communication link between the abnormal PCIe port and the PCIe node mounted on the abnormal PCIe port.
  15. 如权利要求14所述的装置,其特征在于,所述存储器包括公共寄存器和私有寄存器;The apparatus of claim 14, wherein the memory includes a public register and a private register;
    所述PCIe驱动和所述ACPI存储于所述公共寄存器,所述SerDes固件存储于所述私有寄存器;或者,The PCIe driver and the ACPI are stored in the public register, and the SerDes firmware is stored in the private register; or,
    所述PCIe驱动存储于所述公共寄存器,所述ACPI和所述SerDes固件存储于所述私有寄存器。The PCIe driver is stored in the public register, and the ACPI and SerDes firmware are stored in the private register.
  16. 如权利要求10至15中任一项所述的装置,其特征在于,所述处理器通过调用所述存储器中存储的所述计算机程序,还执行如下操作:The apparatus according to any one of claims 10 to 15, wherein the processor further performs the following operations by calling the computer program stored in the memory:
    在所述异常PCIe端口对应的故障类型为不可恢复故障的情况下,禁用所述异常PCIe端口及所述异常PCIe端口的通信链路。In the case that the fault type corresponding to the abnormal PCIe port is an unrecoverable fault, the abnormal PCIe port and the communication link of the abnormal PCIe port are disabled.
  17. 如权利要求10至16中任一项所述的装置,其特征在于,所述可恢复故障包括如下内容中的一项或多项:The apparatus according to any one of claims 10 to 16, wherein the recoverable fault includes one or more of the following:
    数据链路层包传输超时错误、事务层包写配置空间的重试次数过多错误、两比特数据 错误、先进可扩展接口AXI总线的响应错误。Data link layer packet transmission timeout error, transaction layer packet write configuration space too many retries error, two-bit data error, advanced extensible interface AXI bus response error.
  18. 如权利要求10至17中任一项所述的装置,其特征在于,所述异常PCIe端口对应的异常中断信息包括如下内容中的一项或多项:The apparatus according to any one of claims 10 to 17, wherein the abnormal interrupt information corresponding to the abnormal PCIe port includes one or more of the following:
    所述异常PCIe端口的标识、所述异常PCIe端口对应的故障类型、所述异常PCIe端口所在的PCIe核的标识、所述PCIe核连接的中央处理器CPU的标识。The identifier of the abnormal PCIe port, the fault type corresponding to the abnormal PCIe port, the identifier of the PCIe core where the abnormal PCIe port is located, and the identifier of the central processing unit CPU to which the PCIe core is connected.
  19. 如权利要求10至18中任一项所述的装置,其特征在于,还包括通信接口;The apparatus of any one of claims 10 to 18, further comprising a communication interface;
    所述处理器通过调用所述存储器中存储的所述计算机程序,具体执行如下操作:The processor specifically performs the following operations by calling the computer program stored in the memory:
    通过所述通信接口接收所述异常PCIe端口对应的所述异常中断信息;receiving the abnormal interrupt information corresponding to the abnormal PCIe port through the communication interface;
    将所述异常PCIe端口对应的异常中断信息添加在预设工作队列;所述预设工作队列用于存储各PCIe核中的异常PCIe端口对应的异常中断信息;adding the abnormal interrupt information corresponding to the abnormal PCIe port to a preset work queue; the preset work queue is used to store abnormal interrupt information corresponding to the abnormal PCIe port in each PCIe core;
    从所述预设工作队列中获取所述异常PCIe端口对应的异常中断信息。Acquire abnormal interrupt information corresponding to the abnormal PCIe port from the preset work queue.
  20. 一种故障处理系统,其特征在于,包括中央处理器和外围组件互连传递PCIe核,所述PCIe核包括根复合体和至少一个PCIe节点,所述中央处理器连接所述根复合体;所述根复合体中包括至少一个PCIe端口,所述根复合体通过所述至少一个PCIe端口连接所述至少一个PCIe节点;A fault handling system, characterized in that it includes a central processing unit and a peripheral component interconnecting and transmitting a PCIe core, the PCIe core includes a root complex and at least one PCIe node, and the central processing unit is connected to the root complex; The root complex includes at least one PCIe port, and the root complex is connected to the at least one PCIe node through the at least one PCIe port;
    所述根复合体,用于生成所述至少一个PCIe端口中的异常PCIe端口对应的异常中断信息并上报给所述中央处理器;The root complex is used to generate abnormal interrupt information corresponding to an abnormal PCIe port in the at least one PCIe port and report it to the central processing unit;
    所述中央处理器,用于按照如权利要求1至9中任一项所述的故障处理方法对所述异常PCIe端口进行故障处理。The central processing unit is configured to perform fault processing on the abnormal PCIe port according to the fault processing method according to any one of claims 1 to 9.
  21. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储有程序代码,当所述程序代码在计算机上运行时,使得所述计算机执行如权利要求1至9中任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable medium stores a program code, which, when the program code is executed on a computer, causes the computer to execute the method described in any one of claims 1 to 9. method described.
  22. 一种计算机程序产品,其特征在于,包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得所述计算机执行如权利要求1至9中任一项所述的方法。A computer program product, characterized in that it includes computer program code, which, when executed on a computer, causes the computer to perform the method according to any one of claims 1 to 9.
PCT/CN2021/073396 2021-01-22 2021-01-22 Fault handling method and apparatus, and system WO2022155919A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/073396 WO2022155919A1 (en) 2021-01-22 2021-01-22 Fault handling method and apparatus, and system
CN202180090841.4A CN116724297A (en) 2021-01-22 2021-01-22 Fault processing method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/073396 WO2022155919A1 (en) 2021-01-22 2021-01-22 Fault handling method and apparatus, and system

Publications (1)

Publication Number Publication Date
WO2022155919A1 true WO2022155919A1 (en) 2022-07-28

Family

ID=82548392

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073396 WO2022155919A1 (en) 2021-01-22 2021-01-22 Fault handling method and apparatus, and system

Country Status (2)

Country Link
CN (1) CN116724297A (en)
WO (1) WO2022155919A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470048A (en) * 2022-10-28 2022-12-13 摩尔线程智能科技(北京)有限责任公司 Reset method, system-on-chip, electronic device and storage medium
CN116582471A (en) * 2023-07-14 2023-08-11 珠海星云智联科技有限公司 PCIE equipment, PCIE data capturing system and server

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120144231A1 (en) * 2009-03-26 2012-06-07 Nobuo Yagi Arrangements detecting reset pci express bus in pci express path, and disabling use of pci express device
CN103440188A (en) * 2013-08-29 2013-12-11 福建星网锐捷网络有限公司 Method and device for detecting PCIE hardware faults
CN103618618A (en) * 2013-11-13 2014-03-05 福建星网锐捷网络有限公司 Line card fault recovery method and related device based on distributed PCIE system
CN109815043A (en) * 2019-01-25 2019-05-28 华为技术有限公司 Fault handling method, relevant device and computer storage medium
CN110457164A (en) * 2019-07-08 2019-11-15 华为技术有限公司 The method, apparatus and server of equipment management

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120144231A1 (en) * 2009-03-26 2012-06-07 Nobuo Yagi Arrangements detecting reset pci express bus in pci express path, and disabling use of pci express device
CN103440188A (en) * 2013-08-29 2013-12-11 福建星网锐捷网络有限公司 Method and device for detecting PCIE hardware faults
CN103618618A (en) * 2013-11-13 2014-03-05 福建星网锐捷网络有限公司 Line card fault recovery method and related device based on distributed PCIE system
CN109815043A (en) * 2019-01-25 2019-05-28 华为技术有限公司 Fault handling method, relevant device and computer storage medium
CN110457164A (en) * 2019-07-08 2019-11-15 华为技术有限公司 The method, apparatus and server of equipment management

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470048A (en) * 2022-10-28 2022-12-13 摩尔线程智能科技(北京)有限责任公司 Reset method, system-on-chip, electronic device and storage medium
CN115470048B (en) * 2022-10-28 2023-01-17 摩尔线程智能科技(北京)有限责任公司 Reset method, system-on-chip, electronic device and storage medium
CN116582471A (en) * 2023-07-14 2023-08-11 珠海星云智联科技有限公司 PCIE equipment, PCIE data capturing system and server
CN116582471B (en) * 2023-07-14 2023-09-19 珠海星云智联科技有限公司 PCIE equipment, PCIE data capturing system and server

Also Published As

Publication number Publication date
CN116724297A (en) 2023-09-08

Similar Documents

Publication Publication Date Title
US11729044B2 (en) Service resiliency using a recovery controller
WO2020151722A1 (en) Fault processing method, related device, and computer storage medium
US9152592B2 (en) Universal PCI express port
US9030943B2 (en) Recovering from failures without impact on data traffic in a shared bus architecture
US20070169106A1 (en) Simultaneous download to multiple targets
US8880768B2 (en) Storage controller system with data synchronization and method of operation thereof
WO2022155919A1 (en) Fault handling method and apparatus, and system
TW201603040A (en) Method, apparatus and system for handling data error events with a memory controller
US7774638B1 (en) Uncorrectable data error containment systems and methods
US11953976B2 (en) Detecting and recovering from fatal storage errors
US20170220255A1 (en) Write request processing method, processor, and computer
CN110825555A (en) Non-volatile memory switch with host isolation
US20080263391A1 (en) Apparatus, System, and Method For Adapter Card Failover
TW202134899A (en) Server and control method of server
US8880957B2 (en) Facilitating processing in a communications environment using stop signaling
CN114880266B (en) Fault processing method and device, computer equipment and storage medium
JP2014532236A (en) Connection method
JP2004013723A (en) Device and method for fault recovery of information processing system adopted cluster configuration using shared memory
US11797368B2 (en) Attributing errors to input/output peripheral drivers
US20230161599A1 (en) Redundant data log retrieval in multi-processor device
KR102519484B1 (en) Peripheral component interconnect express interface device and system including the same
CN116302625A (en) Fault reporting method, device and storage medium
TW202101238A (en) Server device and communication method between baseboard management controller and programmable logic unit thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21920317

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180090841.4

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21920317

Country of ref document: EP

Kind code of ref document: A1