WO2012167461A1 - Method and system for realizing interconnection fault-tolerance between cpus - Google Patents

Method and system for realizing interconnection fault-tolerance between cpus Download PDF

Info

Publication number
WO2012167461A1
WO2012167461A1 PCT/CN2011/076471 CN2011076471W WO2012167461A1 WO 2012167461 A1 WO2012167461 A1 WO 2012167461A1 CN 2011076471 W CN2011076471 W CN 2011076471W WO 2012167461 A1 WO2012167461 A1 WO 2012167461A1
Authority
WO
WIPO (PCT)
Prior art keywords
link
fpga
connection
interface module
control logic
Prior art date
Application number
PCT/CN2011/076471
Other languages
French (fr)
Chinese (zh)
Inventor
常胜
王海彬
张�杰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2011/076471 priority Critical patent/WO2012167461A1/en
Priority to CN201180001259.2A priority patent/CN102763087B/en
Priority to US13/707,188 priority patent/US8909979B2/en
Publication of WO2012167461A1 publication Critical patent/WO2012167461A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1443Transmit or communication errors

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a method and system for implementing fault tolerance between CPUs.
  • IBM PCB Print Circuit Board
  • NC Node control node controller
  • HP uses the NC node controller and the switch module to interconnect the CPUs, and the system of the entire interconnect architecture is complex.
  • the solution adds two chips to the entire system to implement NC node control and switch module functions. Because the solution uses the switch module to exchange data between NCs, each switch module needs to perform jump point judgment, which increases the delay of data transmission, and the system performance is lower and the cost is higher.
  • the current CPU interconnection scheme has poor scalability, long data transmission delay, and low system performance.
  • any link error may cause the CPU involved. There is an abnormality in the interconnection between the two, and there is no related prior art for the solution of the fault tolerance between the CPUs.
  • the present invention solves the above technical problems in the background art, and proposes a method and system for implementing fault tolerance between CPUs, which can improve the scalability of interconnection between CPUs and realize fault tolerance between CPUs.
  • An embodiment of the present invention provides a method for implementing fault tolerance between CPUs, where the method includes: connecting, by a first CPU, a first fast channel interconnect QPI connection of a first field programmable gate array FPGA
  • the second module is connected to the second QPI interface module of the second FPGA
  • the first serial deserialized SerDes interface module of the first FPGA is connected to the second SerDes interface module of the second FPGA, and is connected by the first control logic module.
  • a second SerDes interface module of the second FPGA is connected to the second QPI interface module by the second control logic module to implement interconnection between the first CPU and the second CPU; wherein the first SerDes A data channel for transmitting link connection state information and a link control signal is added to the interface module and the second SerDes interface module; and the control between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA a logic module, configured to monitor a state of a transmission link connection between the peer FPGA and the corresponding CPU, and control a state of a transmission link connection between the local FPGA and the corresponding CPU;
  • the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself.
  • the FPGAs When the faulty link is restored to the normal state, the FPGAs respectively enable the links of the respective normal states connected to each other, and perform the connection of the links that are interconnected between the first CPU and the second CPU.
  • An implementation system for interconnecting faults between CPUs comprising: at least a first CPU, a second
  • the first CPU is connected to the first QPI interface module of the first FPGA
  • the second CPU is connected to the second QPI interface module of the second FPGA
  • the first SerDes interface module of the first FPGA is connected to the second The second SerDes interface module of the FPGA is connected to the first QPI interface module by using the first control logic module
  • the second SerDes interface module of the second FPGA is connected to the second QPI interface module by using the second control logic module to implement the An interconnection between a CPU and a second CPU
  • the first SerDes interface module and the second SerDes interface module respectively add a data channel for transmitting link connection state information and a link control signal
  • the first FPGA And a control logic module between the corresponding QPI interface module and the SerDes interface module in the second FPGA, used to monitor the state of the transmission link connection between the peer FPGA and the corresponding CPU, and control the transmission between the local FPGA and the corresponding CPU The status of the link connection;
  • the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself.
  • the normal state of the road when the faulty link returns to the normal state, each FPGA enables each link of each normal state connected to each other, and performs between the first CPU and the second CPU.
  • the dedicated FPGA can be increased or decreased.
  • the number is achieved. Therefore, the scalability of the interconnection between the CPUs can be improved; and, by adding a data channel on the FPGA, when any connection link between the interconnected CPUs fails, the connection state information of the links of the CPU interconnections is transmitted through the data channel. And link control signals to achieve fault tolerance between CPUs.
  • FIG. 1 is a schematic structural diagram of an FPGA for implementing interconnection between CPUs according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of implementing CPU interconnection by using the FPGA of FIG. 1;
  • FIG. 3 is a schematic diagram of link connection involved in the interconnection architecture in FIG. 2.
  • FIG. 1 is a schematic structural diagram of an FPGA for implementing interconnection between CPUs according to an embodiment of the present invention.
  • the FPGA Field-Programmable Gate Array
  • QPI Quad Path Interconnect
  • SerDes Serial Deserial
  • QPI interface module 11 of the FPGA Connected to the QPI interface of the CPU, responsible for completing high-speed data transmission with the CPU; a control logic module 13 is also disposed between the QPI interface module and the SerDes interface module, and the SerDes interface module 12 is connected to the control logic module 13, and
  • the SerDes interface module 12 set on an FPGA is connected by a high speed cable (CXP Cable).
  • CXP Cable high speed cable
  • FIG. 2 shows the structure of the CPU interconnect using the above FPGA.
  • the links interconnected between the first CPU and the second CPU are in a normal state, on the FPGA
  • the QPI interface module 11 can convert serial QPI data transmitted by the CPU into parallel QPI data. Since the QPI interface module converts the serial QPI data sent by the CPU into parallel QPI data, the frequency of the QPI data is reduced to accommodate the data processing frequency inside the FPGA.
  • the parallel QPI data received from the QPI interface module 11 can be converted into high-speed serial SerDes data by the SerDes interface module 12 on the FPGA, and sent to the peer CPU through the SerDes interface module 12 set on another FPGA, and The high-speed serial SerDes data sent by the SerDes interface module on the peer FPGA is received, and the received high-speed serial SerDes data is converted into parallel QPI data and sent to the CPU connected to itself.
  • the SerDes interface module Since the SerDes interface module will not support QPI data of long-distance cable interconnection and topology DC characteristics, it can be converted into AC-type SerDes interface, which can support long-distance high-speed cable interconnection and topology, and realize inter-board processing. High speed interconnect.
  • at least one data channel is added on the basis of the original data channel in the SerDes interface module. Unlike the original data channel, the added data channel is not used for data transmission between the interconnected CPUs. It is used to connect the connection status information and link control signals of each link between the FPGAs.
  • a control logic module is disposed between the QPI interface module and the SerDes interface module in the FPGA, which is used to monitor the state of the transmission link connection between the peer FPGA and the corresponding CPU, and control the transmission chain between the local FPGA and the corresponding CPU. The state of the road connection.
  • an embodiment of the present invention provides an implementation method for interconnecting fault tolerance between CPUs.
  • the first CPU is connected to the first QPI interface module of the first FPGA (FPGA0), the second CPU is connected to the second QPI interface module of the second FPGA (FPGA1), and the first SerDes interface module of the first FPGA is connected to the second
  • the second SerDes interface module of the FPGA is connected to the first QPI interface module by using the first control logic module, and the second SerDes interface module of the second FPGA is connected to the second QPI interface module by using the second control logic module to implement the An interconnection between a CPU and a second CPU; wherein the first SerDes interface module is provided with a first data channel for transmitting link connection state information and a link control signal, and the second SerDes interface module is added a second data channel for transmitting link connection state information and a link control signal; a control logic module between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA, configured to monitor the peer FPGA and The status of the link connection between the corresponding CPUs, and control
  • the interconnect architecture in the embodiment of the present invention relates to a QPI link between CPU0 and FPGA0, a high-speed SerDes link between FPGA0 and FPGA1, and a QPI link between FPGA1 and CPU1, among the above three links. If any link fails, it will cause an interconnection abnormality between CPU0 and CPU1.
  • the FPGA connected to the faulty link when any connection link that is interconnected between the first CPU and the second CPU fails, the FPGA connected to the faulty link sends a link to the faulty link through the data channel added by itself. Control signals to restore the normal state of the faulty link;
  • the FPGAs When the faulty link is restored to the normal state, the FPGAs respectively enable the links of the respective normal states connected to each other, and perform the connection of the links that are interconnected between the first CPU and the second CPU.
  • the dedicated FPGA can be increased or decreased.
  • the number is achieved. Therefore, the scalability of the interconnection between the CPUs can be improved; and, by adding a data channel on the FPGA, when any connection link between the interconnected CPUs fails, the connection state information of the links of the CPU interconnections is transmitted through the data channel. And link control signals to achieve fault tolerance between CPUs.
  • FIG. 3 it is a schematic diagram of link connection involved in the interconnection architecture in the embodiment of the present invention.
  • the corresponding links include: QPI link between CPU0 and FPGA0 (referred to as A link), high-speed SerDes link between FPGA0 and FPGA1 (referred to as B link), and QPI link between FPGA1 and CPU1.
  • C link is abbreviated as C link
  • the fault-tolerant solution in the embodiment of the present invention solves the problem on any link of the eight, B, and C links. Abnormal state.
  • the first FPGA and the second FPGA When the control logic module in the first FPGA or the second FPGA detects that the B link between the first SerDes interface module and the second SerDes interface module fails, the first FPGA and the second FPGA The control logic module sends a link control signal to the B link through the respective added data channels to restore the normal state of the B link; At the same time, the first control logic module in the first FPGA controls the A link between the first QPI interface module and the first CPU to be in a reset state by using a data channel added in the first SerDes interface module. The second control logic module in the second FPGA controls the C link between the second QPI interface module and the second CPU to remain in a reset state by using a data channel added in the second SerDes interface module, to wait for the B link.
  • the first control logic module controls the connection of the A link by using a data channel added in the first SerDes interface module; meanwhile, the second control The logic module controls the connection of the C link through a data channel added in the second SerDes interface module.
  • the first FPGA controls the QPI initialization process of the A link, and at the same time, the second FPGA controls the QPI initialization process of the C link, thereby implementing the connection of the A and C links.
  • the first CPU and the second CPU can start normal link communication, thereby completing the interconnection between the first CPU and the second CPU.
  • the first control logic module detects that the A link is faulty, and the second control logic module detects that the C link is faulty, the first control logic module passes the fault information of the A link through the The data channel added in the first SerDes interface module is sent to the second FPGA; at the same time, the second control logic module sends the fault information of the C link through the data channel added in the second SerDes interface module.
  • the first FPGA that is, the first FPGA and the second FPGA perform the interaction of the local QPI link state through the respective added data channels;
  • the control logic module of the first FPGA and the second FPGA that first receives the fault signal of the connection link sends a link control signal to the connection link connecting the corresponding CPU through the data channel added by itself to restore the connection chain.
  • the normal state of the path; and the link control signal is sent to the peer FPGA through the data channel added by itself to control the recovery operation of the corresponding connection link of the control logic module of the peer FPGA.
  • the first FPGA and the second FPGA control the operation signals by exchanging links, so that the A and C links respectively enter the initialization process, and the A and C links are re-established.
  • the first control logic module can monitor Go to the abnormal link, so that the A channel is kept in the reset state by the data channel added in the first SerDes interface module; meanwhile, the second control logic module can detect the B link abnormality, thereby The second control logic module controls the C link to remain in a reset state by using a data channel added in the second SerDes interface module, and waits for the B link to be re-established;
  • first control logic module and the second control logic module send a link control signal to the B link through the respective added data channels to restore the normal state of the B link; until when the B link returns to normal, the The first control logic module controls the connection of the A link through the data channel added in the first SerDes interface module, and the second control logic module controls the C chain through the data channel added in the second SerDes interface module. The connection of the road.
  • the failure of the VIII link and the B link are normal, and the B link is taken as an example.
  • the interconnection fault tolerance scheme between the first CPU and the second CPU is similar to this embodiment.
  • the first control logic module sends the fault information of the A link to the second FPGA through the data channel added in the first SerDes interface module. And sending a link control signal to the second FPGA, so that the second control logic module controls the C link to be in a reset state by using a data channel added in the second SerDes interface module; the first control logic module continues Transmitting, by the data channel added in the first SerDes interface module, a link control signal to the second FPGA, so that the first control logic module controls the A link by using a data channel added in the first SerDes interface module The connection, at the same time, the second control logic module controls the connection of the C link through the data channel added in the second SerDes interface module, thereby realizing the re-establishment of the A link.
  • the A link fails, and the B and C links are normal.
  • the fault tolerance scheme between the first CPU and the second CPU is used when the C link fails and the A and B links are normal. Similar to this embodiment.
  • the FPGA-based transparent transmission function realizes the interconnection structure between the CPUs through the high-speed cable connection, and the fault-tolerant solution of the embodiment of the present invention can well solve the interconnection chain between the interconnected CPUs.
  • the fault state that may occur on the road causes the faulty link to return to the normal state in time, so that the interconnected CPU maintains a stable working state.
  • the embodiment of the present invention further provides an implementation system for interconnecting faults between CPUs, where the system includes: at least a first CPU, a second CPU, a first FPGA, and a second FPGA; and the first CPU is connected to the first FPGA.
  • the first QPI interface module, the second CPU is connected to the second QPI interface module of the second FPGA
  • the first SerDes interface module of the first FPGA is connected to the second SerDes interface module of the second FPGA, and is connected by the first control logic module.
  • a second SerDes interface module of the second FPGA is connected to the second QPI interface module by the second control logic module to implement interconnection between the first CPU and the second CPU; wherein the first SerDes A data channel for transmitting link connection state information and a link control signal is added to the interface module and the second SerDes interface module; and the control between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA a logic module, configured to monitor a state of a transmission link connection between the peer FPGA and the corresponding CPU, and control a state of a transmission link connection between the local FPGA and the corresponding CPU;
  • the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself.
  • the normal state of the road when the faulty link returns to the normal state, each FPGA enables the links of the respective normal states that are connected to each other, and performs the connection of the links that are interconnected between the first CPU and the second CPU. .
  • the dedicated FPGA can be increased or decreased.
  • the number is achieved. Therefore, the scalability of the interconnection between the CPUs can be improved; and, by adding a data channel on the FPGA, when any connection link between the interconnected CPUs fails, the connection state information of the links of the CPU interconnections is transmitted through the data channel. And link control signals to achieve fault tolerance between CPUs.
  • control logic module in the first FPGA and/or the second FPGA detects that the second connection link between the first SerDes interface module and the second SerDes interface module fails
  • the control logic module in the first FPGA and the second FPGA is configured to send a link control signal to the second connection link through the respective added data channels to restore the normal state of the second connection link;
  • the first control logic module in the FPGA controls the first connection between the first QPI interface module and the first CPU by using a data channel added in the first SerDes interface module
  • the second control logic module in the second FPGA controls the third connection chain between the second QPI interface module and the second CPU by using a data channel added in the second SerDes interface module.
  • the first control logic module controls the connection of the first connection link through the data channel added in the first SerDes interface module, when the second connection link returns to normal.
  • the second control logic module controls the connection of the third connection link by using a data channel added in the second SerDes interface module.
  • the first control logic module detects that the first connection link is faulty and the second control logic module detects that the third connection link is faulty
  • the first control logic module is configured to: The fault information of the connection link is sent to the second FPGA through the data channel added in the first SerDes interface module; meanwhile, the second control logic module passes the fault information of the third connection link through the The additional data channel in the second SerDes interface module is sent to the first FPGA;
  • the control logic module of the first FPGA and the second FPGA that first receives the fault signal of the connection link sends a link control signal to the connection link connecting the corresponding CPU through the data channel added by itself to restore the connection chain.
  • the normal state of the path; and the link control signal is sent to the peer FPGA through the data channel added by itself to control the recovery operation of the corresponding connection link of the control logic module of the peer FPGA.
  • the first (second) control logic module is configured to pass the An additional data channel in the first (2) SerDes interface module controls the first (three) connection link to remain in a reset state; meanwhile, the second (one) control logic module passes the second (one) SerDes
  • the additional data channel in the interface module controls the third (1) connection link to remain in a reset state, and the first control logic module and the second control logic module send a chain to the second connection link through respective added data channels.
  • the first (second) control logic module is added through the first (second) SerDes interface module
  • the data channel controls the connection of the first (three) connection link
  • the second (one) control logic module passes the data added in the second (one) SerDes interface module
  • the channel controls the connection of the third (one) connection link.
  • the first (two) control logic module is configured to: pass the failure information of the first (three) connection link to the first (2) SerDes interface An additional data channel in the module is sent to the second (1) FPGA, and a link control signal is sent to the second (1) FPGA, so that the second (1) control logic module passes the second ( a) an additional data channel in the SerDes interface module controls the third (one) link to be in a reset state;
  • the first (2) control logic module is further configured to: continue to send a link control signal to the second (1) FPGA by using a data channel added in the first (2) SerDes interface module, so that the first a control logic module controls a connection of the first connection link by using a data channel added in the first SerDes interface module, and the second control logic module passes an additional data channel in the second SerDes interface module Controlling the connection of the third connection link.
  • the FPGA-based transparent transmission function realizes the interconnection structure between the inter-board CPUs through the high-speed cable connection, and the fault-tolerant solution of the embodiment of the present invention can solve the interconnection well.
  • the fault state that may occur in the inter-CPU interconnect link causes the faulty link to return to the normal state in time, so that the interconnected CPU maintains a stable working state.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • modules described as separate components may or may not be physically separate, and the components displayed as modules may or may not be physical modules, ie may be located One place, or it can be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)
  • Logic Circuits (AREA)

Abstract

A method for realizing interconnection fault-tolerance between CPUs comprises: data channels for status information of transmission link connection and link control signals are added both in a first SerDes (Serial Deserial) interface module of a first FPGA (Field-Programmable Gate Array) and a second SerDes interface module of a second FPGA; a control logical module monitors the status of the transmission link connection between an opposite end FPGA and the corresponding CPU and controls the status of the transmission link connection between a local end FPGA and the corresponding CPU; when any connection link for realizing interconnection between the first CPU and the second CPU has a fault, the FPGA connected with the fault link sends a link control signal to the fault link through self-added data channel so as to recover the normal status of the fault link; when the fault link has been recovered to be the normal status, each FPGA uses each link with normal status of the own connection respectively, thereby realizing the connection of each link interconnected between the first CPU and the second CPU.

Description

CPU间互联容错的实现方法及系统 技术领域  Method and system for implementing fault tolerance between CPUs
本发明涉及通信技术领域, 尤其涉及一种 CPU间互联容错的实现方法及 系统。  The present invention relates to the field of communications technologies, and in particular, to a method and system for implementing fault tolerance between CPUs.
背景技术 Background technique
现有技术中, 实现 CPU之间互联的方案有如下两种:  In the prior art, there are two solutions for implementing interconnection between CPUs:
一种实现方案为, IBM公司 PCB ( Printed Circuit Board, 印制电路板)全 直连的方式实现各 CPU之间的互连。每个 IBM的 Power CPU自带 7个高速互 连接口, 可同时与 7个 Power CPU互连。 8个 Power CPU通过全直连方式可 组成 8P系统。 但因 Power CPU集成了 NC ( Node control 节点控制器 ) 的功 能, 所以成本较高。 受 Power CPU互连接口数量限制, 由 Power CPU组成的 CPU系统可扩展性差, 灵活度低;  One implementation is that the IBM PCB (Printed Circuit Board) is directly connected to each other to interconnect the CPUs. Each IBM Power CPU comes with seven high-speed interconnects that can be interconnected with seven Power CPUs simultaneously. Eight Power CPUs can be combined into an 8P system through a full direct connection. However, because the Power CPU integrates the functions of the NC (Node control node controller), the cost is high. Limited by the number of Power CPU interconnect interfaces, the CPU system consisting of Power CPUs has poor scalability and low flexibility;
另一种实现方案为, HP公司釆用 NC节点控制器和 switch模块实现 CPU 之间的互连,整个互连架构的系统复杂。该方案在整个系统中增加了 2个芯片, 分别实现 NC节点控制和 switch模块功能。 因为该方案釆用 switch模块进行 NC间数据交换,每个 switch模块需要进行跳点判断,增加了数据传输的延时, 系统性能较低, 成本较高。  Another implementation is that HP uses the NC node controller and the switch module to interconnect the CPUs, and the system of the entire interconnect architecture is complex. The solution adds two chips to the entire system to implement NC node control and switch module functions. Because the solution uses the switch module to exchange data between NCs, each switch module needs to perform jump point judgment, which increases the delay of data transmission, and the system performance is lower and the cost is higher.
因此, 目前的 CPU互连方案, 可扩展性差, 数据传输延时较长, 系统性 能低下; 此外, 在实现 CPU互联的各条链路中, 任何一条链路出现错误都可 能导致所涉及的 CPU之间互联发生异常,而对于 CPU间互联容错的解决方案, 并未存在相关现有技术。  Therefore, the current CPU interconnection scheme has poor scalability, long data transmission delay, and low system performance. In addition, in any link that implements CPU interconnection, any link error may cause the CPU involved. There is an abnormality in the interconnection between the two, and there is no related prior art for the solution of the fault tolerance between the CPUs.
发明内容 Summary of the invention
本发明为解决背景技术中存在的上述技术问题, 而提出一种 CPU间互联 容错的实现方法及系统, 能够提高 CPU间互连的扩展性, 实现 CPU间互连容 错。  The present invention solves the above technical problems in the background art, and proposes a method and system for implementing fault tolerance between CPUs, which can improve the scalability of interconnection between CPUs and realize fault tolerance between CPUs.
本发明的技术解决方案是:  The technical solution of the present invention is:
本发明实施例提供一种 CPU间互联容错的实现方法, 所述方法包括: 第一 CPU连接第一现场可编程门阵列 FPGA的第一快速通道互联 QPI接 口模块、 第二 CPU连接第二 FPGA的第二 QPI接口模块, 第一 FPGA的第一 串解串 SerDes接口模块连接第二 FPGA的第二 SerDes接口模块、并通过第一 控制逻辑模块连接第一 QPI接口模块, 第二 FPGA的第二 SerDes接口模块通 过第二控制逻辑模块与第二 QPI接口模块相连, 以实现所述第一 CPU和第二 CPU之间的互联; 其中, 所述第一 SerDes接口模块和第二 SerDes接口模块中 均增设有传输链路连接状态信息和链路控制信号的数据通道; 所述第一 FPGA 和第二 FPGA中在相应 QPI接口模块和 SerDes接口模块之间的控制逻辑模块, 用于监测对端 FPGA与相应 CPU之间传输链路连接的状态,并控制本端 FPGA 与相应 CPU之间传输链路连接的状态; An embodiment of the present invention provides a method for implementing fault tolerance between CPUs, where the method includes: connecting, by a first CPU, a first fast channel interconnect QPI connection of a first field programmable gate array FPGA The second module is connected to the second QPI interface module of the second FPGA, and the first serial deserialized SerDes interface module of the first FPGA is connected to the second SerDes interface module of the second FPGA, and is connected by the first control logic module. a second SerDes interface module of the second FPGA is connected to the second QPI interface module by the second control logic module to implement interconnection between the first CPU and the second CPU; wherein the first SerDes A data channel for transmitting link connection state information and a link control signal is added to the interface module and the second SerDes interface module; and the control between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA a logic module, configured to monitor a state of a transmission link connection between the peer FPGA and the corresponding CPU, and control a state of a transmission link connection between the local FPGA and the corresponding CPU;
当所述第一 CPU和第二 CPU之间实现互联的任意连接链路发生故障,则 与该故障链路连接的 FPGA通过自身增设的数据通道向故障链路发送链路控 制信号以恢复故障链路的正常状态;  When any connection link that implements interconnection between the first CPU and the second CPU fails, the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself. The normal state of the road;
当故障链路恢复正常状态时,各 FPGA分别启用各自连接的各条正常状态 的链路, 进行所述第一 CPU和第二 CPU之间实现互联的各条链路的连接。  When the faulty link is restored to the normal state, the FPGAs respectively enable the links of the respective normal states connected to each other, and perform the connection of the links that are interconnected between the first CPU and the second CPU.
一种 CPU间互联容错的实现系统, 所述系统包括: 至少第一 CPU、 第二 An implementation system for interconnecting faults between CPUs, the system comprising: at least a first CPU, a second
CPU, 第一 FPGA和第二 FPGA; 第一 CPU连接第一 FPGA的第一 QPI接口 模块、 第二 CPU连接第二 FPGA的第二 QPI接口模块, 第一 FPGA的第一 SerDes接口模块连接第二 FPGA的第二 SerDes接口模块、 并通过第一控制逻 辑模块连接第一 QPI接口模块, 第二 FPGA的第二 SerDes接口模块通过第二 控制逻辑模块与第二 QPI接口模块相连,以实现所述第一 CPU和第二 CPU之 间的互联; 其中, 所述第一 SerDes接口模块和第二 SerDes接口模块中均增设 有传输链路连接状态信息和链路控制信号的数据通道;所述第一 FPGA和第二 FPGA中在相应 QPI接口模块和 SerDes接口模块之间的控制逻辑模块, 用于 监测对端 FPGA与相应 CPU之间传输链路连接的状态, 并控制本端 FPGA与 相应 CPU之间传输链路连接的状态; a CPU, a first FPGA and a second FPGA; the first CPU is connected to the first QPI interface module of the first FPGA, the second CPU is connected to the second QPI interface module of the second FPGA, and the first SerDes interface module of the first FPGA is connected to the second The second SerDes interface module of the FPGA is connected to the first QPI interface module by using the first control logic module, and the second SerDes interface module of the second FPGA is connected to the second QPI interface module by using the second control logic module to implement the An interconnection between a CPU and a second CPU; wherein, the first SerDes interface module and the second SerDes interface module respectively add a data channel for transmitting link connection state information and a link control signal; the first FPGA And a control logic module between the corresponding QPI interface module and the SerDes interface module in the second FPGA, used to monitor the state of the transmission link connection between the peer FPGA and the corresponding CPU, and control the transmission between the local FPGA and the corresponding CPU The status of the link connection;
当所述第一 CPU和第二 CPU之间实现互联的任意连接链路发生故障,则 与该故障链路连接的 FPGA通过自身增设的数据通道向故障链路发送链路控 制信号以恢复故障链路的正常状态; 当故障链路恢复正常状态时,各 FPGA分 别启用各自连接的各条正常状态的链路,进行所述第一 CPU和第二 CPU之间 实现互联的各条链路的连接。 When any connection link that implements interconnection between the first CPU and the second CPU fails, the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself. The normal state of the road; when the faulty link returns to the normal state, each FPGA enables each link of each normal state connected to each other, and performs between the first CPU and the second CPU The connection of the various links that implement the interconnection.
本发明实施例中, 通过为 CPU设置 FPGA, 基于 FPGA的透传功能, 通 过 FPGA之间的连接实现 CPU之间的互联,在内部互联的 CPU增加或减少时 , 可通过增加或减少专用 FPGA的个数来实现。 因此, 能够提高 CPU间互连的 扩展性; 并且, 通过在 FPGA上增设数据通道, 当互联的 CPU之间的任意连 接链路发生故障, 通过该数据通道传输 CPU互联各链路的连接状态信息和链 路控制信号, 实现 CPU间互连容错。  In the embodiment of the present invention, by setting the FPGA for the CPU, the transparent transmission function based on the FPGA, and the interconnection between the CPUs through the connection between the FPGAs, when the internal interconnected CPU is increased or decreased, the dedicated FPGA can be increased or decreased. The number is achieved. Therefore, the scalability of the interconnection between the CPUs can be improved; and, by adding a data channel on the FPGA, when any connection link between the interconnected CPUs fails, the connection state information of the links of the CPU interconnections is transmitted through the data channel. And link control signals to achieve fault tolerance between CPUs.
附图说明 DRAWINGS
图 1为本发明实施例提供的实现 CPU之间互联的 FPGA的结构示意图; 图 2为利用图 1中 FPGA实现 CPU互联的结构示意图;  1 is a schematic structural diagram of an FPGA for implementing interconnection between CPUs according to an embodiment of the present invention; FIG. 2 is a schematic structural diagram of implementing CPU interconnection by using the FPGA of FIG. 1;
图 3为图 2中的互联架构所涉及的链路连接示意图。  FIG. 3 is a schematic diagram of link connection involved in the interconnection architecture in FIG. 2.
具体实施方式 detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清 楚、 完整的描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是 全部的实施例。基于本发明中的实施例, 本领域普通技术人员在没有做出创造 性劳动前提下所获得的所有其他实施例, 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention will be described in detail with reference to the accompanying drawings. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without the creative work are all within the scope of the present invention.
首先, 参见图 1所示, 为本发明实施例中实现 CPU之间互联的 FPGA的 结构示意图。 该 FPGA ( Field - Programmable Gate Array, 现场可编程门阵列) 设置有 QPI ( Quick Path Interconnect, 快速通道互联)接口模块 11和 SerDes ( Serial Deserial, 串解串)接口模块 12, FPGA的 QPI接口模块 11与 CPU 的 QPI接口连接, 负责完成与 CPU之间高速数据的传输; 在 QPI接口模块和 SerDes接口模块之间还设置有控制逻辑模块 13 , SerDes接口模块 12与控制逻 辑模块 13连接, 并与另一个 FPGA上设置的 SerDes接口模块 12通过高速线 缆(CXP Cable )进行连接。  First, referring to FIG. 1, FIG. 1 is a schematic structural diagram of an FPGA for implementing interconnection between CPUs according to an embodiment of the present invention. The FPGA (Field-Programmable Gate Array) is provided with a QPI (Quick Path Interconnect) interface module 11 and a SerDes (Serial Deserial) interface module 12, and a QPI interface module 11 of the FPGA. Connected to the QPI interface of the CPU, responsible for completing high-speed data transmission with the CPU; a control logic module 13 is also disposed between the QPI interface module and the SerDes interface module, and the SerDes interface module 12 is connected to the control logic module 13, and The SerDes interface module 12 set on an FPGA is connected by a high speed cable (CXP Cable).
图 2所示为利用上述 FPGA实现 CPU互联的结构示意图。  Figure 2 shows the structure of the CPU interconnect using the above FPGA.
为了描述方便,以两个 CPU进行互联为例进行说明,并将互联的两个 CPU 分别命名为第一 CPU ( CPU0 )和第二 CPU ( CPU1 ), 第一 CPU和第二 CPU 分别连接有 FPGA, 即第一 FPGA ( FPGA0 )和第二 FPGA ( FPGA1 )。  For convenience of description, an example is described in which two CPUs are interconnected, and two interconnected CPUs are named as a first CPU (CPU0) and a second CPU (CPU1), and the first CPU and the second CPU are respectively connected to an FPGA. , that is, the first FPGA (FPGA0) and the second FPGA (FPGA1).
在第一 CPU和第二 CPU之间互联的各链路处于正常状态下, FPGA上的 QPI接口模块 11可以将 CPU发送的串行 QPI数据转换成并行 QPI数据。 由 于 QPI接口模块将 CPU发送的串行 QPI数据转换成并行 QPI数据, 降低了 QPI数据的频率, 以适应 FPGA内部的数据处理频率。 The links interconnected between the first CPU and the second CPU are in a normal state, on the FPGA The QPI interface module 11 can convert serial QPI data transmitted by the CPU into parallel QPI data. Since the QPI interface module converts the serial QPI data sent by the CPU into parallel QPI data, the frequency of the QPI data is reduced to accommodate the data processing frequency inside the FPGA.
而通过 FPGA上的 SerDes接口模块 12, 可以将从 QPI接口模块 11接收 到的并行 QPI数据转换成高速串行 SerDes数据, 并通过另一个 FPGA上设置 的 SerDes接口模块 12发送至对端 CPU, 并接收对端 FPGA上的 SerDes接口 模块发送的高速串行 SerDes数据, 并将接收到的高速串行 SerDes数据转换成 并行 QPI数据, 发送至自身连接的 CPU。  The parallel QPI data received from the QPI interface module 11 can be converted into high-speed serial SerDes data by the SerDes interface module 12 on the FPGA, and sent to the peer CPU through the SerDes interface module 12 set on another FPGA, and The high-speed serial SerDes data sent by the SerDes interface module on the peer FPGA is received, and the received high-speed serial SerDes data is converted into parallel QPI data and sent to the CPU connected to itself.
由于 SerDes接口模块将不支持长距离电缆互连和拓朴的 DC特性的 QPI 数据, 转换成 AC特性的 SerDes接口, 从而可以支持长距离高速度电缆互连 和拓朴, 实现了板间的处理器高速互连。 本发明实施例中, 在 SerDes接口模 块中原有数据通道的基础之上, 增设至少一条数据通道, 与原有数据通道不同 的是, 该增设的数据通道并不用于互联的 CPU之间数据的传输, 而是用于 FPGA之间传输互联各链路的连接状态信息和链路控制信号。  Since the SerDes interface module will not support QPI data of long-distance cable interconnection and topology DC characteristics, it can be converted into AC-type SerDes interface, which can support long-distance high-speed cable interconnection and topology, and realize inter-board processing. High speed interconnect. In the embodiment of the present invention, at least one data channel is added on the basis of the original data channel in the SerDes interface module. Unlike the original data channel, the added data channel is not used for data transmission between the interconnected CPUs. It is used to connect the connection status information and link control signals of each link between the FPGAs.
此外, FPGA中在 QPI接口模块和 SerDes接口模块之间设置有控制逻辑 模块, 用于监测对端 FPGA与相应 CPU之间传输链路连接的状态, 并控制本 端 FPGA与相应 CPU之间传输链路连接的状态。  In addition, a control logic module is disposed between the QPI interface module and the SerDes interface module in the FPGA, which is used to monitor the state of the transmission link connection between the peer FPGA and the corresponding CPU, and control the transmission chain between the local FPGA and the corresponding CPU. The state of the road connection.
相应上述互联架构, 本发明实施例提供一种 CPU间互联容错的实现方法 实施例。  Corresponding to the foregoing interconnection architecture, an embodiment of the present invention provides an implementation method for interconnecting fault tolerance between CPUs.
具体实施时, 第一 CPU连接第一 FPGA ( FPGA0 )的第一 QPI接口模块、 第二 CPU连接第二 FPGA ( FPGA1 ) 的第二 QPI接口模块, 第一 FPGA的第 一 SerDes接口模块连接第二 FPGA的第二 SerDes接口模块、并通过第一控制 逻辑模块连接第一 QPI接口模块, 第二 FPGA的第二 SerDes接口模块通过第 二控制逻辑模块与第二 QPI接口模块相连, 以实现所述第一 CPU和第二 CPU 之间的互联; 其中, 所述第一 SerDes接口模块中增设有传输链路连接状态信 息和链路控制信号的第一数据通道, 所述第二 SerDes接口模块中增设有传输 链路连接状态信息和链路控制信号的第二数据通道; 所述第一 FPGA和第二 FPGA中在相应 QPI接口模块和 SerDes接口模块之间的控制逻辑模块, 用于 监测对端 FPGA与相应 CPU之间传输链路连接的状态, 并控制本端 FPGA与 相应 CPU之间传输链路连接的状态; In a specific implementation, the first CPU is connected to the first QPI interface module of the first FPGA (FPGA0), the second CPU is connected to the second QPI interface module of the second FPGA (FPGA1), and the first SerDes interface module of the first FPGA is connected to the second The second SerDes interface module of the FPGA is connected to the first QPI interface module by using the first control logic module, and the second SerDes interface module of the second FPGA is connected to the second QPI interface module by using the second control logic module to implement the An interconnection between a CPU and a second CPU; wherein the first SerDes interface module is provided with a first data channel for transmitting link connection state information and a link control signal, and the second SerDes interface module is added a second data channel for transmitting link connection state information and a link control signal; a control logic module between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA, configured to monitor the peer FPGA and The status of the link connection between the corresponding CPUs, and control the local FPGA and The state of the link connection between the corresponding CPUs;
由此,本发明实施例中的互联架构涉及 CPU0与 FPGA0之间的 QPI链路、 FPGA0和 FPGA1之间的高速 SerDes链路以及 FPGAl与 CPU1之间的 QPI链 路, 上述 3条链路中的任何一条链路发生故障, 都会导致 CPU0 和 CPU1之 间出现互联异常;  Therefore, the interconnect architecture in the embodiment of the present invention relates to a QPI link between CPU0 and FPGA0, a high-speed SerDes link between FPGA0 and FPGA1, and a QPI link between FPGA1 and CPU1, among the above three links. If any link fails, it will cause an interconnection abnormality between CPU0 and CPU1.
本发明实施例中, 当所述第一 CPU和第二 CPU之间实现互联的任意连接 链路发生故障,则与该故障链路连接的 FPGA通过自身增设的数据通道向故障 链路发送链路控制信号以恢复故障链路的正常状态;  In the embodiment of the present invention, when any connection link that is interconnected between the first CPU and the second CPU fails, the FPGA connected to the faulty link sends a link to the faulty link through the data channel added by itself. Control signals to restore the normal state of the faulty link;
当故障链路恢复正常状态时,各 FPGA分别启用各自连接的各条正常状态 的链路, 进行所述第一 CPU和第二 CPU之间实现互联的各条链路的连接。  When the faulty link is restored to the normal state, the FPGAs respectively enable the links of the respective normal states connected to each other, and perform the connection of the links that are interconnected between the first CPU and the second CPU.
本发明实施例中, 通过为 CPU设置 FPGA, 基于 FPGA的透传功能, 通 过 FPGA之间的连接实现 CPU之间的互联,在内部互联的 CPU增加或减少时 , 可通过增加或减少专用 FPGA的个数来实现。 因此, 能够提高 CPU间互连的 扩展性; 并且, 通过在 FPGA上增设数据通道, 当互联的 CPU之间的任意连 接链路发生故障, 通过该数据通道传输 CPU互联各链路的连接状态信息和链 路控制信号, 实现 CPU间互连容错。  In the embodiment of the present invention, by setting the FPGA for the CPU, the transparent transmission function based on the FPGA, and the interconnection between the CPUs through the connection between the FPGAs, when the internal interconnected CPU is increased or decreased, the dedicated FPGA can be increased or decreased. The number is achieved. Therefore, the scalability of the interconnection between the CPUs can be improved; and, by adding a data channel on the FPGA, when any connection link between the interconnected CPUs fails, the connection state information of the links of the CPU interconnections is transmitted through the data channel. And link control signals to achieve fault tolerance between CPUs.
为了便于对本发明实施例技术方案的充分理解,下面将结合本发明实施例 中的附图, 对本发明实施例中的技术方案进行清楚、 完整的描述。  In order to facilitate a full understanding of the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described.
如图 3所示, 为本发明实施例中的互联架构所涉及的链路连接示意图。相 应的链路包括: CPU0与 FPGA0之间的 QPI链路(简称为 A链路)、 FPGA0 和 FPGA1之间的高速 SerDes链路(简称为 B链路) 以及 FPGA1与 CPU1之 间的 QPI链路(简称为 C链路), A、 B、 C链路都有可能出现工作异常的故障 状态, 因此, 本发明实施例中的容错方案即解决在八、 B、 C任意一条链路上 出现的异常状态。  As shown in FIG. 3, it is a schematic diagram of link connection involved in the interconnection architecture in the embodiment of the present invention. The corresponding links include: QPI link between CPU0 and FPGA0 (referred to as A link), high-speed SerDes link between FPGA0 and FPGA1 (referred to as B link), and QPI link between FPGA1 and CPU1. (C link is abbreviated as C link), and the faulty state of the A, B, and C links may be abnormal. Therefore, the fault-tolerant solution in the embodiment of the present invention solves the problem on any link of the eight, B, and C links. Abnormal state.
实施例一  Embodiment 1
当所述第一 FPGA或第二 FPGA中的控制逻辑模块监测到所述第一 SerDes 接口模块和第二 SerDes接口模块之间的 B链路发生故障时, 所述第一 FPGA 和第二 FPGA中的控制逻辑模块通过各自增设的数据通道向 B链路发送链路 控制信号以恢复 B链路的正常状态; 同时, 所述第一 FPGA中的第一控制逻辑模块通过第一 SerDes接口模块 中增设的数据通道控制所述第一 QPI接口模块与第一 CPU之间的 A链路保持 在复位状态, 所述第二 FPGA中的第二控制逻辑模块通过第二 SerDes接口模 块中增设的数据通道控制所述第二 QPI接口模块与第二 CPU之间的 C链路保 持在复位状态, 以等待 B链路的成功建立, 直至当所述 B链路恢复正常时, 所述第一控制逻辑模块通过所述第一 SerDes接口模块中增设的数据通道控制 所述 A链路的连接; 同时, 所述第二控制逻辑模块通过所述第二 SerDes接口 模块中增设的数据通道控制所述 C链路的连接。 When the control logic module in the first FPGA or the second FPGA detects that the B link between the first SerDes interface module and the second SerDes interface module fails, the first FPGA and the second FPGA The control logic module sends a link control signal to the B link through the respective added data channels to restore the normal state of the B link; At the same time, the first control logic module in the first FPGA controls the A link between the first QPI interface module and the first CPU to be in a reset state by using a data channel added in the first SerDes interface module. The second control logic module in the second FPGA controls the C link between the second QPI interface module and the second CPU to remain in a reset state by using a data channel added in the second SerDes interface module, to wait for the B link. Successfully establishing, until the B link returns to normal, the first control logic module controls the connection of the A link by using a data channel added in the first SerDes interface module; meanwhile, the second control The logic module controls the connection of the C link through a data channel added in the second SerDes interface module.
具体实施时, 当 B链路成功建立之后, 第一 FPGA控制 A链路的 QPI初 始化流程, 同时, 第二 FPGA控制 C链路的 QPI初始化流程, 从而实现 A、 C 链路的连接。  In the specific implementation, after the B link is successfully established, the first FPGA controls the QPI initialization process of the A link, and at the same time, the second FPGA controls the QPI initialization process of the C link, thereby implementing the connection of the A and C links.
当 A、 B、 C三条链路均建立完成之后, 第一 CPU和第二 CPU便可以开 始正常的链路通信, 从而完成第一 CPU和第二 CPU之间的互联。  After the three links A, B, and C are all established, the first CPU and the second CPU can start normal link communication, thereby completing the interconnection between the first CPU and the second CPU.
实施例二  Embodiment 2
当所述第一控制逻辑模块监测到 A链路发生故障、 且第二控制逻辑模块 监测到 C链路发生故障时, 所述第一控制逻辑模块将所述 A链路的故障信息 通过所述第一 SerDes接口模块中增设的数据通道发送至所述第二 FPGA; 同 时, 所述第二控制逻辑模块将所述 C链路的故障信息通过所述第二 SerDes接 口模块中增设的数据通道发送至所述第一 FPGA;即,第一 FPGA和第二 FPGA 通过各自增设的数据通道进行本端 QPI链路状态的交互;  When the first control logic module detects that the A link is faulty, and the second control logic module detects that the C link is faulty, the first control logic module passes the fault information of the A link through the The data channel added in the first SerDes interface module is sent to the second FPGA; at the same time, the second control logic module sends the fault information of the C link through the data channel added in the second SerDes interface module. To the first FPGA; that is, the first FPGA and the second FPGA perform the interaction of the local QPI link state through the respective added data channels;
所述第一 FPGA和第二 FPGA中首先接收到连接链路的故障信的一方的控 制逻辑模块, 通过自身增设的数据通道向连接相应 CPU 的连接链路发送链路 控制信号以恢复该连接链路的正常状态; 并通过自身增设的数据通道向对端 FPGA发送链路控制信号,以控制对端 FPGA的控制逻辑模块发起自身对应的 连接链路的恢复操作。  The control logic module of the first FPGA and the second FPGA that first receives the fault signal of the connection link sends a link control signal to the connection link connecting the corresponding CPU through the data channel added by itself to restore the connection chain. The normal state of the path; and the link control signal is sent to the peer FPGA through the data channel added by itself to control the recovery operation of the corresponding connection link of the control logic module of the peer FPGA.
第一 FPGA和第二 FPGA通过相互交换链路控制操作信号,使得 A、 C链 路分别进入初始化流程, 完成 A、 C链路的重新建立。  The first FPGA and the second FPGA control the operation signals by exchanging links, so that the A and C links respectively enter the initialization process, and the A and C links are re-established.
实施例三  Embodiment 3
当 、 B链路发生故障, C链路正常时, 所述第一控制逻辑模块能够监测 到上述异常链路, 从而通过所述第一 SerDes接口模块中增设的数据通道控制 所述 A路保持在复位状态; 同时, 所述第二控制逻辑模块能够监测到 B链路 异常, 从而, 所述第二控制逻辑模块通过所述第二 SerDes接口模块中增设的 数据通道控制所述 C链路保持在复位状态, 等待 B链路的重新建立; When the B link fails and the C link is normal, the first control logic module can monitor Go to the abnormal link, so that the A channel is kept in the reset state by the data channel added in the first SerDes interface module; meanwhile, the second control logic module can detect the B link abnormality, thereby The second control logic module controls the C link to remain in a reset state by using a data channel added in the second SerDes interface module, and waits for the B link to be re-established;
此外,所述第一控制逻辑模块和第二控制逻辑模块通过各自增设的数据通 道向 B链路发送链路控制信号以恢复 B链路的正常状态; 直至当 B链路恢复 正常时, 所述第一控制逻辑模块通过所述第一 SerDes接口模块中增设的数据 通道控制 A链路的连接, 同时, 所述第二控制逻辑模块通过所述第二 SerDes 接口模块中增设的数据通道控制 C链路的连接。  In addition, the first control logic module and the second control logic module send a link control signal to the B link through the respective added data channels to restore the normal state of the B link; until when the B link returns to normal, the The first control logic module controls the connection of the A link through the data channel added in the first SerDes interface module, and the second control logic module controls the C chain through the data channel added in the second SerDes interface module. The connection of the road.
本发明实施例以八、 B链路发生故障, C链路正常为例进行说明, 当 B、 In the embodiment of the present invention, the failure of the VIII link and the B link are normal, and the B link is taken as an example.
C链路发生故障, A链路正常时, 第一 CPU和第二 CPU之间的互联容错方案 与该实施例相似。 When the C link fails and the A link is normal, the interconnection fault tolerance scheme between the first CPU and the second CPU is similar to this embodiment.
实施例四  Embodiment 4
当 A链路发生故障, B、 C链路正常时, 所述第一控制逻辑模块将 A链路 的故障信息通过所述第一 SerDes接口模块中增设的数据通道发送至所述第二 FPGA, 并向所述第二 FPGA发送链路控制信号, 使得所述第二控制逻辑模块 通过所述第二 SerDes接口模块中增设的数据通道控制 C链路处于复位状态; 所述第一控制逻辑模块继续通过所述第一 SerDes接口模块中增设的数据 通道向所述第二 FPGA发送链路控制信号,使得所述第一控制逻辑模块通过所 述第一 SerDes接口模块中增设的数据通道控制 A链路的连接, 同时, 所述第 二控制逻辑模块通过所述第二 SerDes接口模块中增设的数据通道控制 C链路 的连接, 从而实现 A链路的重新建立。  When the A link is faulty and the B and C links are normal, the first control logic module sends the fault information of the A link to the second FPGA through the data channel added in the first SerDes interface module. And sending a link control signal to the second FPGA, so that the second control logic module controls the C link to be in a reset state by using a data channel added in the second SerDes interface module; the first control logic module continues Transmitting, by the data channel added in the first SerDes interface module, a link control signal to the second FPGA, so that the first control logic module controls the A link by using a data channel added in the first SerDes interface module The connection, at the same time, the second control logic module controls the connection of the C link through the data channel added in the second SerDes interface module, thereby realizing the re-establishment of the A link.
本发明实施例以 A链路发生故障, B、 C链路正常为例进行说明, 当 C 链路发生故障, A、 B链路正常时, 第一 CPU和第二 CPU之间的互联容错方 案与该实施例相似。  In the embodiment of the present invention, the A link fails, and the B and C links are normal. The fault tolerance scheme between the first CPU and the second CPU is used when the C link fails and the A and B links are normal. Similar to this embodiment.
通过上述各实施例可见,基于 FPGA的透传功能,通过高速线缆连接实现 板间 CPU之间的互联架构中, 釆用本发明实施例的容错解决方案可以很好的 解决互联 CPU间互联链路可能出现的故障状态, 使得故障链路及时恢复正常 状态, 使互连 CPU之间保持稳定的工作状态。 相应地, 本发明实施例还提供了一种 CPU间互联容错的实现系统, 所述 系统包括: 至少第一 CPU、 第二 CPU、 第一 FPGA和第二 FPGA; 第一 CPU 连接第一 FPGA的第一 QPI接口模块、 第二 CPU连接第二 FPGA的第二 QPI 接口模块, 第一 FPGA的第一 SerDes接口模块连接第二 FPGA的第二 SerDes 接口模块、 并通过第一控制逻辑模块连接第一 QPI接口模块, 第二 FPGA的 第二 SerDes接口模块通过第二控制逻辑模块与第二 QPI接口模块相连, 以实 现所述第一 CPU和第二 CPU之间的互联; 其中, 所述第一 SerDes接口模块 和第二 SerDes接口模块中均增设有传输链路连接状态信息和链路控制信号的 数据通道; 所述第一 FPGA和第二 FPGA中在相应 QPI接口模块和 SerDes接 口模块之间的控制逻辑模块, 用于监测对端 FPGA与相应 CPU之间传输链路 连接的状态, 并控制本端 FPGA与相应 CPU之间传输链路连接的状态; It can be seen from the above embodiments that the FPGA-based transparent transmission function realizes the interconnection structure between the CPUs through the high-speed cable connection, and the fault-tolerant solution of the embodiment of the present invention can well solve the interconnection chain between the interconnected CPUs. The fault state that may occur on the road causes the faulty link to return to the normal state in time, so that the interconnected CPU maintains a stable working state. Correspondingly, the embodiment of the present invention further provides an implementation system for interconnecting faults between CPUs, where the system includes: at least a first CPU, a second CPU, a first FPGA, and a second FPGA; and the first CPU is connected to the first FPGA. The first QPI interface module, the second CPU is connected to the second QPI interface module of the second FPGA, the first SerDes interface module of the first FPGA is connected to the second SerDes interface module of the second FPGA, and is connected by the first control logic module. a second SerDes interface module of the second FPGA is connected to the second QPI interface module by the second control logic module to implement interconnection between the first CPU and the second CPU; wherein the first SerDes A data channel for transmitting link connection state information and a link control signal is added to the interface module and the second SerDes interface module; and the control between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA a logic module, configured to monitor a state of a transmission link connection between the peer FPGA and the corresponding CPU, and control a state of a transmission link connection between the local FPGA and the corresponding CPU;
当所述第一 CPU和第二 CPU之间实现互联的任意连接链路发生故障,则 与该故障链路连接的 FPGA通过自身增设的数据通道向故障链路发送链路控 制信号以恢复故障链路的正常状态; 当故障链路恢复正常状态时,各 FPGA分 别启用各自连接的各条正常状态的链路,进行所述第一 CPU和第二 CPU之间 实现互联的各条链路的连接。  When any connection link that implements interconnection between the first CPU and the second CPU fails, the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself. The normal state of the road; when the faulty link returns to the normal state, each FPGA enables the links of the respective normal states that are connected to each other, and performs the connection of the links that are interconnected between the first CPU and the second CPU. .
上述系统实施例中, 通过为 CPU设置 FPGA, 基于 FPGA的透传功能, 通过 FPGA之间的连接实现 CPU之间的互联,在内部互联的 CPU增加或减少 时, 可通过增加或减少专用 FPGA的个数来实现。 因此, 能够提高 CPU间互 连的扩展性; 并且, 通过在 FPGA上增设数据通道, 当互联的 CPU之间的任 意连接链路发生故障, 通过该数据通道传输 CPU互联各链路的连接状态信息 和链路控制信号, 实现 CPU间互连容错。  In the above system embodiment, by setting the FPGA for the CPU, the FPGA-based transparent transmission function, and the interconnection between the CPUs through the connection between the FPGAs, when the internal interconnected CPU is increased or decreased, the dedicated FPGA can be increased or decreased. The number is achieved. Therefore, the scalability of the interconnection between the CPUs can be improved; and, by adding a data channel on the FPGA, when any connection link between the interconnected CPUs fails, the connection state information of the links of the CPU interconnections is transmitted through the data channel. And link control signals to achieve fault tolerance between CPUs.
具体实施过程中, 当所述第一 FPGA和 /或第二 FPGA中的控制逻辑模块 监测到所述第一 SerDes接口模块和第二 SerDes接口模块之间的第二连接链路 发生故障时,所述第一 FPGA和第二 FPGA中的控制逻辑模块用于,通过各自 增设的数据通道向第二连接链路发送链路控制信号以恢复第二连接链路的正 常状态; 同时, 所述第一 FPGA中的第一控制逻辑模块通过第一 SerDes接口 模块中增设的数据通道控制所述第一 QPI接口模块与第一 CPU之间的第一连 接链路保持在复位状态, 所述第二 FPGA 中的第二控制逻辑模块通过第二 SerDes接口模块中增设的数据通道控制所述第二 QPI接口模块与第二 CPU之 间的第三连接链路保持在复位状态, 直至当所述第二连接链路恢复正常时, 所 述第一控制逻辑模块通过所述第一 SerDes接口模块中增设的数据通道控制所 述第一连接链路的连接; 同时, 所述第二控制逻辑模块通过所述第二 SerDes 接口模块中增设的数据通道控制所述第三连接链路的连接。 In a specific implementation, when the control logic module in the first FPGA and/or the second FPGA detects that the second connection link between the first SerDes interface module and the second SerDes interface module fails, The control logic module in the first FPGA and the second FPGA is configured to send a link control signal to the second connection link through the respective added data channels to restore the normal state of the second connection link; The first control logic module in the FPGA controls the first connection between the first QPI interface module and the first CPU by using a data channel added in the first SerDes interface module The second control logic module in the second FPGA controls the third connection chain between the second QPI interface module and the second CPU by using a data channel added in the second SerDes interface module. The first control logic module controls the connection of the first connection link through the data channel added in the first SerDes interface module, when the second connection link returns to normal. At the same time, the second control logic module controls the connection of the third connection link by using a data channel added in the second SerDes interface module.
当所述第一控制逻辑模块监测到第一连接链路发生故障、且第二控制逻辑 模块监测到第三连接链路发生故障时, 所述第一控制逻辑模块用于,将所述第 一连接链路的故障信息通过所述第一 SerDes接口模块中增设的数据通道发送 至所述第二 FPGA; 同时, 所述第二控制逻辑模块将所述第三连接链路的故障 信息通过所述第二 SerDes接口模块中增设的数据通道发送至所述第一 FPGA;  When the first control logic module detects that the first connection link is faulty and the second control logic module detects that the third connection link is faulty, the first control logic module is configured to: The fault information of the connection link is sent to the second FPGA through the data channel added in the first SerDes interface module; meanwhile, the second control logic module passes the fault information of the third connection link through the The additional data channel in the second SerDes interface module is sent to the first FPGA;
所述第一 FPGA和第二 FPGA中首先接收到连接链路的故障信的一方的控 制逻辑模块, 通过自身增设的数据通道向连接相应 CPU 的连接链路发送链路 控制信号以恢复该连接链路的正常状态; 并通过自身增设的数据通道向对端 FPGA发送链路控制信号,以控制对端 FPGA的控制逻辑模块发起自身对应的 连接链路的恢复操作。  The control logic module of the first FPGA and the second FPGA that first receives the fault signal of the connection link sends a link control signal to the connection link connecting the corresponding CPU through the data channel added by itself to restore the connection chain. The normal state of the path; and the link control signal is sent to the peer FPGA through the data channel added by itself to control the recovery operation of the corresponding connection link of the control logic module of the peer FPGA.
当所述第一(三)连接链路、 第二连接链路发生故障时, 所述第三(一) 连接链路正常时,所述第一(二)控制逻辑模块用于,通过所述第一(二)SerDes 接口模块中增设的数据通道控制所述第一 (三)连接链路保持在复位状态; 同 时, 所述第二(一)控制逻辑模块通过所述第二(一) SerDes接口模块中增 设的数据通道控制所述第三(一)连接链路保持在复位状态, 所述第一控制逻 辑模块和第二控制逻辑模块通过各自增设的数据通道向第二连接链路发送链 路控制信号以恢复第二连接链路的正常状态;直至当所述第二连接链路恢复正 常时, 所述第一(二)控制逻辑模块通过所述第一(二) SerDes接口模块中 增设的数据通道控制所述第一(三)连接链路的连接, 同时, 所述第二(一) 控制逻辑模块通过所述第二(一) SerDes接口模块中增设的数据通道控制所 述第三(一)连接链路的连接。  When the first (three) connection link and the second connection link are faulty, when the third (one) connection link is normal, the first (second) control logic module is configured to pass the An additional data channel in the first (2) SerDes interface module controls the first (three) connection link to remain in a reset state; meanwhile, the second (one) control logic module passes the second (one) SerDes The additional data channel in the interface module controls the third (1) connection link to remain in a reset state, and the first control logic module and the second control logic module send a chain to the second connection link through respective added data channels. a road control signal to restore a normal state of the second connection link; until the second connection link returns to normal, the first (second) control logic module is added through the first (second) SerDes interface module The data channel controls the connection of the first (three) connection link, and the second (one) control logic module passes the data added in the second (one) SerDes interface module The channel controls the connection of the third (one) connection link.
当所述第一(三 )连接链路发生故障时, 所述第一(二 )控制逻辑模块用 于, 将所述第一(三)连接链路的故障信息通过所述第一(二) SerDes接口 模块中增设的数据通道发送至所述第二(一)FPGA,并向所述第二(一) FPGA 发送链路控制信号,使得所述第二(一)控制逻辑模块通过所述第二(一 )SerDes 接口模块中增设的数据通道控制所述第三(一)连接链路处于复位状态; When the first (three) connection link fails, the first (two) control logic module is configured to: pass the failure information of the first (three) connection link to the first (2) SerDes interface An additional data channel in the module is sent to the second (1) FPGA, and a link control signal is sent to the second (1) FPGA, so that the second (1) control logic module passes the second ( a) an additional data channel in the SerDes interface module controls the third (one) link to be in a reset state;
所述第一 (二)控制逻辑模块还用于, 继续通过所述第一 (二) SerDes 接口模块中增设的数据通道向所述第二(一) FPGA发送链路控制信号, 使得 所述第一控制逻辑模块通过所述第一 SerDes接口模块中增设的数据通道控制 所述第一连接链路的连接, 同时,所述第二控制逻辑模块通过所述第二 SerDes 接口模块中增设的数据通道控制所述第三连接链路的连接。  The first (2) control logic module is further configured to: continue to send a link control signal to the second (1) FPGA by using a data channel added in the first (2) SerDes interface module, so that the first a control logic module controls a connection of the first connection link by using a data channel added in the first SerDes interface module, and the second control logic module passes an additional data channel in the second SerDes interface module Controlling the connection of the third connection link.
因此, 上述 CPU间互联容错的实现系统中, 基于 FPGA的透传功能, 通 过高速线缆连接实现板间 CPU之间的互联架构, 釆用本发明实施例的容错解 决方案可以很好的解决互联 CPU间互联链路可能出现的故障状态, 使得故障 链路及时恢复正常状态, 使互连 CPU之间保持稳定的工作状态。 对于系统实施例而言, 由于其基本相应于方法实施例, 所以描述得比较简 单,相关之处参见方法实施例的部分说明即可。 以上所描述的系统实施例仅仅 是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上 分开的,作为模块显示的部件可以是或者也可以不是物理模块, 即可以位于一 个地方, 或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的 部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出 创造性劳动的情况下, 即可以理解并实施。  Therefore, in the implementation system of the inter-CPU interconnection fault tolerance, the FPGA-based transparent transmission function realizes the interconnection structure between the inter-board CPUs through the high-speed cable connection, and the fault-tolerant solution of the embodiment of the present invention can solve the interconnection well. The fault state that may occur in the inter-CPU interconnect link causes the faulty link to return to the normal state in time, so that the interconnected CPU maintains a stable working state. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The system embodiments described above are merely illustrative, wherein the modules described as separate components may or may not be physically separate, and the components displayed as modules may or may not be physical modules, ie may be located One place, or it can be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本 发明。 对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见 的,本文中所定义的一般原理可以在不脱离本发明实施例的精神或范围的情况 下, 在其它实施例中实现。 因此, 本发明实施例将不会被限制于本文所示的这 些实施例, 而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。  The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the embodiments of the invention. . Therefore, the present embodiments of the invention are not to be limited to the embodiments shown herein, but are to be accorded to the broadest scope of the principles and novel features disclosed herein.

Claims

权 利 要 求 Rights request
1、 一种 CPU间互联容错的实现方法, 其特征在于, 所述方法包括: 第一 CPU连接第一现场可编程门阵列 FPGA的第一快速通道互联 QPI接 口模块、 第二 CPU连接第二 FPGA的第二 QPI接口模块, 第一 FPGA的第一 串解串 SerDes接口模块连接第二 FPGA的第二 SerDes接口模块、并通过第一 控制逻辑模块连接第一 QPI接口模块, 第二 FPGA的第二 SerDes接口模块通 过第二控制逻辑模块与第二 QPI接口模块相连, 以实现所述第一 CPU和第二 CPU之间的互联; 其中, 所述第一 SerDes接口模块和第二 SerDes接口模块中 均增设有传输链路连接状态信息和链路控制信号的数据通道; 所述第一 FPGA 和第二 FPGA中在相应 QPI接口模块和 SerDes接口模块之间的控制逻辑模块, 用于监测对端 FPGA与相应 CPU之间传输链路连接的状态,并控制本端 FPGA 与相应 CPU之间传输链路连接的状态;  A method for implementing fault tolerance between CPUs, the method comprising: connecting, by a first CPU, a first fast channel interconnect QPI interface module of a first field programmable gate array FPGA, and a second CPU connecting a second FPGA The second QPI interface module, the first serial deserialized SerDes interface module of the first FPGA is connected to the second SerDes interface module of the second FPGA, and is connected to the first QPI interface module by the first control logic module, and the second FPGA The SerDes interface module is connected to the second QPI interface module by using the second control logic module to implement interconnection between the first CPU and the second CPU; wherein, the first SerDes interface module and the second SerDes interface module are a data channel is provided with transmission link connection state information and a link control signal; a control logic module between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA is used to monitor the peer FPGA and The status of the link connection between the corresponding CPUs is controlled, and the state of the transmission link connection between the local FPGA and the corresponding CPU is controlled;
当所述第一 CPU和第二 CPU之间实现互联的任意连接链路发生故障,则 与该故障链路连接的 FPGA通过自身增设的数据通道向故障链路发送链路控 制信号以恢复故障链路的正常状态;  When any connection link that implements interconnection between the first CPU and the second CPU fails, the FPGA connected to the faulty link sends a link control signal to the faulty link to recover the fault chain through the data channel added by itself. The normal state of the road;
当故障链路恢复正常状态时,各 FPGA分别启用各自连接的各条正常状态 的链路, 进行所述第一 CPU和第二 CPU之间实现互联的各条链路的连接。  When the faulty link is restored to the normal state, the FPGAs respectively enable the links of the respective normal states connected to each other, and perform the connection of the links that are interconnected between the first CPU and the second CPU.
2、 根据权利要求 1所述的 CPU间互联容错的实现方法, 其特征在于, 当 所述第一 CPU和第二 CPU之间的任意连接链路发生故障,则与该故障链路连 接的 FPGA通过自身增设的数据通道向故障链路发送链路控制信号以恢复故 障链路的正常状态, 具体包括:  2. The method for implementing fault tolerance between CPUs according to claim 1, wherein when any connection link between the first CPU and the second CPU fails, an FPGA connected to the faulty link is provided. The link control signal is sent to the faulty link to restore the normal state of the faulty link through the data channel that is added by itself, including:
当所述第一 FPGA或第二 FPGA中的控制逻辑模块监测到所述第一 SerDes 接口模块和第二 SerDes接口模块之间的第二连接链路发生故障时, 所述第一 FPGA和第二 FPGA中的控制逻辑模块通过各自增设的数据通道向第二连接链 路发送链路控制信号以恢复第二连接链路的正常状态; 同时, 所述第一 FPGA 中的第一控制逻辑模块通过第一 SerDes接口模块中增设的数据通道控制所述 第一 QPI接口模块与第一 CPU之间的第一连接链路保持在复位状态, 所述第 二 FPGA中的第二控制逻辑模块通过第二 SerDes接口模块中增设的数据通道 控制所述第二 QPI接口模块与第二 CPU之间的第三连接链路保持在复位状态, 直至当所述第二连接链路恢复正常时, 所述第一控制逻辑模块通过所述第一 SerDes接口模块中增设的数据通道控制所述第一连接链路的连接; 同时, 所 述第二控制逻辑模块通过所述第二 SerDes接口模块中增设的数据通道控制所 述第三连接链路的连接。 When the control logic module in the first FPGA or the second FPGA detects that the second connection link between the first SerDes interface module and the second SerDes interface module fails, the first FPGA and the second The control logic module in the FPGA sends a link control signal to the second connection link to restore the normal state of the second connection link through the respective added data channels; meanwhile, the first control logic module in the first FPGA passes the An additional data channel in a SerDes interface module controls the first connection link between the first QPI interface module and the first CPU to remain in a reset state, and the second control logic module in the second FPGA passes the second SerDes The additional data channel in the interface module controls the third connection link between the second QPI interface module and the second CPU to remain in a reset state. The first control logic module controls the connection of the first connection link by using a data channel added in the first SerDes interface module until the second connection link returns to normal; The control logic module controls the connection of the third connection link through a data channel added in the second SerDes interface module.
3、 根据权利要求 2所述的 CPU间互联容错的实现方法, 其特征在于, 当 所述第一 CPU和第二 CPU之间的任意连接链路发生故障,则与该故障链路连 接的 FPGA通过自身增设的数据通道向故障链路发送链路控制信号以恢复故 障链路的正常状态, 具体包括:  3. The method for implementing fault tolerance between CPUs according to claim 2, wherein when any connection link between the first CPU and the second CPU fails, the FPGA connected to the faulty link The link control signal is sent to the faulty link to restore the normal state of the faulty link through the data channel that is added by itself, including:
当所述第一控制逻辑模块监测到第一连接链路发生故障、且第二控制逻辑 模块监测到第三连接链路发生故障时,所述第一控制逻辑模块将所述第一连接 链路的故障信息通过所述第一 SerDes接口模块中增设的数据通道发送至所述 第二 FPGA; 同时, 所述第二控制逻辑模块将所述第三连接链路的故障信息通 过所述第二 SerDes接口模块中增设的数据通道发送至所述第一 FPGA;  When the first control logic module detects that the first connection link fails and the second control logic module detects that the third connection link fails, the first control logic module connects the first connection link The fault information is sent to the second FPGA by using a data channel added in the first SerDes interface module; meanwhile, the second control logic module passes the fault information of the third connection link to the second SerDes An additional data channel in the interface module is sent to the first FPGA;
所述第一 FPGA和第二 FPGA中首先接收到连接链路的故障信的一方的控 制逻辑模块, 通过自身增设的数据通道向连接相应 CPU 的连接链路发送链路 控制信号以恢复该连接链路的正常状态; 并通过自身增设的数据通道向对端 FPGA发送链路控制信号,以控制对端 FPGA的控制逻辑模块发起自身对应的 连接链路的恢复操作。  The control logic module of the first FPGA and the second FPGA that first receives the fault signal of the connection link sends a link control signal to the connection link connecting the corresponding CPU through the data channel added by itself to restore the connection chain. The normal state of the path; and the link control signal is sent to the peer FPGA through the data channel added by itself to control the recovery operation of the corresponding connection link of the control logic module of the peer FPGA.
4、 根据权利要求 2所述的 CPU间互联容错的实现方法, 其特征在于, 当 所述第一 CPU和第二 CPU之间的任意连接链路发生故障,则与该故障链路连 接的 FPGA通过自身增设的数据通道向故障链路发送链路控制信号以恢复故 障链路的正常状态, 具体包括:  The method for implementing fault tolerance between CPUs according to claim 2, wherein when any connection link between the first CPU and the second CPU fails, the FPGA connected to the faulty link The link control signal is sent to the faulty link to restore the normal state of the faulty link through the data channel that is added by itself, including:
当所述第一连接链路、第二连接链路发生故障时, 所述第三连接链路正常 时, 所述第一控制逻辑模块通过所述第一 SerDes接口模块中增设的数据通道 控制所述第一连接链路保持在复位状态; 同时, 所述第二控制逻辑模块通过所 述第二 SerDes接口模块中增设的数据通道控制所述第三连接链路保持在复位 状态,所述第一控制逻辑模块和第二控制逻辑模块通过各自增设的数据通道向 第二连接链路发送链路控制信号以恢复第二连接链路的正常状态;直至当所述 第二连接链路恢复正常时, 所述第一控制逻辑模块通过所述第一 SerDes接口 模块中增设的数据通道控制所述第一连接链路的连接, 同时, 所述第二控制逻 辑模块通过所述第二 SerDes接口模块中增设的数据通道控制所述第三连接链 路的连接。 When the first connection link and the second connection link are faulty, when the third connection link is normal, the first control logic module controls the data channel through the additional data channel in the first SerDes interface module. The first connection link is maintained in a reset state; at the same time, the second control logic module controls the third connection link to remain in a reset state by using a data channel added in the second SerDes interface module, the first The control logic module and the second control logic module send a link control signal to the second connection link through the respective added data channels to restore the normal state of the second connection link; until when the second connection link returns to normal, The first control logic module passes the first SerDes interface The data channel added in the module controls the connection of the first connection link, and the second control logic module controls the connection of the third connection link through the data channel added in the second SerDes interface module.
5、 根据权利要求 2所述的 CPU间互联容错的实现方法, 其特征在于, 当 所述第一 CPU和第二 CPU之间的任意连接链路发生故障,则与该故障链路连 接的 FPGA通过自身增设的数据通道向故障链路发送链路控制信号以恢复故 障链路的正常状态, 具体包括:  The method for implementing fault tolerance between CPUs according to claim 2, wherein when any connection link between the first CPU and the second CPU fails, the FPGA connected to the faulty link The link control signal is sent to the faulty link to restore the normal state of the faulty link through the data channel that is added by itself, including:
当所述第一连接链路发生故障时,所述第一控制逻辑模块将所述第一连接 链路的故障信息通过所述第一 SerDes接口模块中增设的数据通道发送至所述 第二 FPGA, 并向所述第二 FPGA发送链路控制信号, 使得所述第二控制逻辑 模块通过所述第二 SerDes接口模块中增设的数据通道控制所述第三连接链路 处于复位状态;  When the first connection link fails, the first control logic module sends the fault information of the first connection link to the second FPGA by using a data channel added in the first SerDes interface module. And sending a link control signal to the second FPGA, so that the second control logic module controls the third connection link to be in a reset state by using a data channel added in the second SerDes interface module;
所述第一控制逻辑模块继续通过所述第一 SerDes接口模块中增设的数据 通道向所述第二 FPGA发送链路控制信号,使得所述第一控制逻辑模块通过所 述第一 SerDes接口模块中增设的数据通道控制所述第一连接链路的连接, 同 时, 所述第二控制逻辑模块通过所述第二 SerDes接口模块中增设的数据通道 控制所述第三连接链路的连接。  The first control logic module continues to send a link control signal to the second FPGA by using a data channel added in the first SerDes interface module, so that the first control logic module passes through the first SerDes interface module. The additional data channel controls the connection of the first connection link, and the second control logic module controls the connection of the third connection link by using a data channel added in the second SerDes interface module.
6、 一种 CPU间互联容错的实现系统, 其特征在于, 所述系统包括: 至少 第一 CPU、 第二 CPU、 第一 FPGA和第二 FPGA; 第一 CPU连接第一 FPGA 的第一 QPI接口模块、 第二 CPU连接第二 FPGA的第二 QPI接口模块, 第一 FPGA的第一 SerDes接口模块连接第二 FPGA的第二 SerDes接口模块、 并通 过第一控制逻辑模块连接第一 QPI接口模块, 第二 FPGA的第二 SerDes接口 模块通过第二控制逻辑模块与第二 QPI接口模块相连, 以实现所述第一 CPU 和第二 CPU之间的互联; 其中, 所述第一 SerDes接口模块和第二 SerDes接 口模块中均增设有传输链路连接状态信息和链路控制信号的数据通道;所述第 一 FPGA和第二 FPGA中在相应 QPI接口模块和 SerDes接口模块之间的控制 逻辑模块, 用于监测对端 FPGA与相应 CPU之间传输链路连接的状态, 并控 制本端 FPGA与相应 CPU之间传输链路连接的状态;  6. The system for implementing fault tolerance between CPUs, wherein the system comprises: at least a first CPU, a second CPU, a first FPGA, and a second FPGA; and the first CPU is connected to the first QPI interface of the first FPGA. The module, the second CPU is connected to the second QPI interface module of the second FPGA, the first SerDes interface module of the first FPGA is connected to the second SerDes interface module of the second FPGA, and the first QPI interface module is connected by the first control logic module. The second SerDes interface module of the second FPGA is connected to the second QPI interface module by using the second control logic module to implement interconnection between the first CPU and the second CPU; wherein the first SerDes interface module and the first A data channel for transmitting link connection state information and a link control signal is added to the second SerDes interface module; and a control logic module between the corresponding QPI interface module and the SerDes interface module in the first FPGA and the second FPGA is used. Monitoring the state of the transmission link connection between the peer FPGA and the corresponding CPU, and controlling the state of the transmission link connection between the local FPGA and the corresponding CPU;
当所述第一 CPU和第二 CPU之间实现互联的任意连接链路发生故障,则 与该故障链路连接的 FPGA通过自身增设的数据通道向故障链路发送链路控 制信号以恢复故障链路的正常状态; 当故障链路恢复正常状态时,各 FPGA分 别启用各自连接的各条正常状态的链路,进行所述第一 CPU和第二 CPU之间 实现互联的各条链路的连接。 When any connection link that implements interconnection between the first CPU and the second CPU fails, The FPGA connected to the faulty link sends a link control signal to the faulty link through its own added data channel to restore the normal state of the faulty link. When the faulty link returns to the normal state, each FPGA enables each connected link. The link in the normal state is connected to each link that implements interconnection between the first CPU and the second CPU.
7、 根据权利要求 6所述的 CPU间互联容错的实现系统, 其特征在于, 当 所述第一 FPGA或第二 FPGA中的控制逻辑模块监测到所述第一 SerDes接口 模块和第二 SerDes接口模块之间的第二连接链路发生故障时,  The system for implementing fault tolerance between CPUs according to claim 6, wherein when the control logic module in the first FPGA or the second FPGA monitors the first SerDes interface module and the second SerDes interface When the second connection link between modules fails,
所述第一 FPGA和第二 FPGA中的控制逻辑模块用于,通过各自增设的数 据通道向第二连接链路发送链路控制信号以恢复第二连接链路的正常状态;同 时, 所述第一 FPGA中的第一控制逻辑模块通过第一 SerDes接口模块中增设 的数据通道控制所述第一 QPI接口模块与第一 CPU之间的第一连接链路保持 在复位状态, 所述第二 FPGA中的第二控制逻辑模块通过第二 SerDes接口模 块中增设的数据通道控制所述第二 QPI接口模块与第二 CPU之间的第三连接 链路保持在复位状态, 直至当所述第二连接链路恢复正常时, 所述第一控制逻 辑模块通过所述第一 SerDes接口模块中增设的数据通道控制所述第一连接链 路的连接; 同时, 所述第二控制逻辑模块通过所述第二 SerDes接口模块中增 设的数据通道控制所述第三连接链路的连接。  The control logic module in the first FPGA and the second FPGA is configured to send a link control signal to the second connection link through the respective added data channels to restore the normal state of the second connection link; The first control logic module in an FPGA controls the first connection link between the first QPI interface module and the first CPU to be in a reset state by using a data channel added in the first SerDes interface module, the second FPGA The second control logic module controls the third connection link between the second QPI interface module and the second CPU to remain in a reset state through a data channel added in the second SerDes interface module until the second connection When the link is restored, the first control logic module controls the connection of the first connection link by using a data channel added in the first SerDes interface module; meanwhile, the second control logic module passes the The additional data channel in the two SerDes interface modules controls the connection of the third connection link.
8、 根据权利要求 7所述的 CPU间互联容错的实现系统, 其特征在于, 当 所述第一控制逻辑模块监测到第一连接链路发生故障、且第二控制逻辑模块监 测到第三连接链路发生故障时,  The system for implementing fault tolerance between CPUs according to claim 7, wherein when the first control logic module detects that the first connection link fails, and the second control logic module detects the third connection When the link fails,
所述第一控制逻辑模块用于,将所述第一连接链路的故障信息通过所述第 一 SerDes接口模块中增设的数据通道发送至所述第二 FPGA; 同时, 所述第 二控制逻辑模块将所述第三连接链路的故障信息通过所述第二 SerDes接口模 块中增设的数据通道发送至所述第一 FPGA;  The first control logic module is configured to send fault information of the first connection link to the second FPGA by using a data channel added in the first SerDes interface module; and, the second control logic The module sends the fault information of the third connection link to the first FPGA through the data channel added in the second SerDes interface module;
所述第一 FPGA和第二 FPGA中首先接收到连接链路的故障信的一方的控 制逻辑模块, 通过自身增设的数据通道向连接相应 CPU 的连接链路发送链路 控制信号以恢复该连接链路的正常状态; 并通过自身增设的数据通道向对端 FPGA发送链路控制信号,以控制对端 FPGA的控制逻辑模块发起自身对应的 连接链路的恢复操作。 The control logic module of the first FPGA and the second FPGA that first receives the fault signal of the connection link sends a link control signal to the connection link connecting the corresponding CPU through the data channel added by itself to restore the connection chain. The normal state of the path; and the link control signal is sent to the peer FPGA through the data channel added by itself to control the recovery operation of the corresponding connection link of the control logic module of the peer FPGA.
9、 根据权利要求 7所述的 CPU间互联容错的实现系统, 其特征在于, 当 所述第一连接链路、 第二连接链路发生故障时, 所述第三连接链路正常时, 所述第一控制逻辑模块用于, 通过所述第一 SerDes接口模块中增设的数 据通道控制所述第一连接链路保持在复位状态; 同时, 所述第二控制逻辑模块 通过所述第二 SerDes接口模块中增设的数据通道控制所述第三连接链路保持 在复位状态,所述第一控制逻辑模块和第二控制逻辑模块通过各自增设的数据 通道向第二连接链路发送链路控制信号以恢复第二连接链路的正常状态;直至 当所述第二连接链路恢复正常时,所述第一控制逻辑模块通过所述第一 SerDes 接口模块中增设的数据通道控制所述第一连接链路的连接, 同时, 所述第二控 制逻辑模块通过所述第二 SerDes接口模块中增设的数据通道控制所述第三连 接链路的连接。 The system for implementing fault tolerance between CPUs according to claim 7, wherein when the first connection link and the second connection link fail, when the third connection link is normal, The first control logic module is configured to control, by using a data channel added in the first SerDes interface module, that the first connection link remains in a reset state; and at the same time, the second control logic module passes the second SerDes The additional data channel in the interface module controls the third connection link to remain in a reset state, and the first control logic module and the second control logic module send a link control signal to the second connection link through respective added data channels. To restore the normal state of the second connection link; until the second connection link returns to normal, the first control logic module controls the first connection by using a data channel added in the first SerDes interface module a connection of the link, and the second control logic module controls the third by using a data channel added in the second SerDes interface module Contact connection link.
10、 根据权利要求 7所述的 CPU间互联容错的实现系统, 其特征在于, 当所述第一连接链路发生故障时,  10. The system for implementing fault tolerance between CPUs according to claim 7, wherein when the first connection link fails,
所述第一控制逻辑模块用于,将所述第一连接链路的故障信息通过所述第 一 SerDes接口模块中增设的数据通道发送至所述第二 FPGA, 并向所述第二 FPGA发送链路控制信号, 使得所述第二控制逻辑模块通过所述第二 SerDes 接口模块中增设的数据通道控制所述第三连接链路处于复位状态;  The first control logic module is configured to send fault information of the first connection link to the second FPGA by using a data channel added in the first SerDes interface module, and send the fault information to the second FPGA a link control signal, so that the second control logic module controls the third connection link to be in a reset state by using a data channel added in the second SerDes interface module;
所述第一控制逻辑模块还用于, 继续通过所述第一 SerDes接口模块中增 设的数据通道向所述第二 FPGA发送链路控制信号,使得所述第一控制逻辑模 块通过所述第一 SerDes接口模块中增设的数据通道控制所述第一连接链路的 连接, 同时, 所述第二控制逻辑模块通过所述第二 SerDes接口模块中增设的 数据通道控制所述第三连接链路的连接。  The first control logic module is further configured to: continue to send a link control signal to the second FPGA by using a data channel added in the first SerDes interface module, so that the first control logic module passes the first The data channel added in the SerDes interface module controls the connection of the first connection link, and the second control logic module controls the connection of the third connection link through the data channel added in the second SerDes interface module. connection.
PCT/CN2011/076471 2011-06-27 2011-06-28 Method and system for realizing interconnection fault-tolerance between cpus WO2012167461A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2011/076471 WO2012167461A1 (en) 2011-06-28 2011-06-28 Method and system for realizing interconnection fault-tolerance between cpus
CN201180001259.2A CN102763087B (en) 2011-06-28 2011-06-28 Method and system for realizing interconnection fault-tolerance between CPUs
US13/707,188 US8909979B2 (en) 2011-06-27 2012-12-06 Method and system for implementing interconnection fault tolerance between CPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/076471 WO2012167461A1 (en) 2011-06-28 2011-06-28 Method and system for realizing interconnection fault-tolerance between cpus

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/076430 Continuation-In-Part WO2012103712A1 (en) 2011-06-27 2011-06-27 Cpu interconnect device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/707,188 Continuation US8909979B2 (en) 2011-06-27 2012-12-06 Method and system for implementing interconnection fault tolerance between CPU

Publications (1)

Publication Number Publication Date
WO2012167461A1 true WO2012167461A1 (en) 2012-12-13

Family

ID=47056378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/076471 WO2012167461A1 (en) 2011-06-27 2011-06-28 Method and system for realizing interconnection fault-tolerance between cpus

Country Status (2)

Country Link
CN (1) CN102763087B (en)
WO (1) WO2012167461A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493259A (en) * 2023-12-28 2024-02-02 苏州元脑智能科技有限公司 Data storage system, method and server

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034613A (en) * 2012-12-12 2013-04-10 深圳市华力特电气股份有限公司 Data communication method between processors and FPGA (field programmable gate array) equipment
CN106055436A (en) * 2016-05-19 2016-10-26 浪潮电子信息产业股份有限公司 Method for testing QPI data lane Degrade function
CN107579936A (en) * 2017-09-11 2018-01-12 北京腾凌科技有限公司 Message transmitting method, controller and storage system
CN107515601A (en) * 2017-09-22 2017-12-26 北京腾凌科技有限公司 Control device and method
CN113246117B (en) * 2020-02-11 2023-08-22 株式会社日立制作所 Control method and equipment of robot and building management system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101634959A (en) * 2009-08-21 2010-01-27 北京航空航天大学 Dual redundant fault-tolerant system based on embedded type CPU,
CN101819556A (en) * 2010-03-26 2010-09-01 北京经纬恒润科技有限公司 Signal-processing board
CN101833491A (en) * 2010-04-26 2010-09-15 浪潮电子信息产业股份有限公司 Method for realizing design and FPGA of link detection circuit in node interconnection system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101634959A (en) * 2009-08-21 2010-01-27 北京航空航天大学 Dual redundant fault-tolerant system based on embedded type CPU,
CN101819556A (en) * 2010-03-26 2010-09-01 北京经纬恒润科技有限公司 Signal-processing board
CN101833491A (en) * 2010-04-26 2010-09-15 浪潮电子信息产业股份有限公司 Method for realizing design and FPGA of link detection circuit in node interconnection system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493259A (en) * 2023-12-28 2024-02-02 苏州元脑智能科技有限公司 Data storage system, method and server
CN117493259B (en) * 2023-12-28 2024-04-05 苏州元脑智能科技有限公司 Data storage system, method and server

Also Published As

Publication number Publication date
CN102763087B (en) 2015-03-11
CN102763087A (en) 2012-10-31

Similar Documents

Publication Publication Date Title
JP5021037B2 (en) Communication system having master / slave structure
JP4782823B2 (en) User terminal, master unit, communication system and operation method thereof
WO2012167461A1 (en) Method and system for realizing interconnection fault-tolerance between cpus
CN101710314B (en) High-speed peripheral component interconnection switching controller and realizing method thereof
US9300574B2 (en) Link aggregation emulation for virtual NICs in a cluster server
TWI454088B (en) Method, system and computer readable medium for detecting a switch failure and managing switch failover in a fiber channel over ethernet network
KR100699386B1 (en) Methods and apparatuses for the physical layer initialization of a link-based system interconnect
WO2013097485A1 (en) Disk array, storage system, and method for switching data storage paths
WO2011137797A1 (en) Method and system for data transmission in ethernet
CN101977139A (en) Route retransmission realization device and method, and switching equipment
WO2012019464A1 (en) Inter-plate interconnection device and method in router cluster
US8909979B2 (en) Method and system for implementing interconnection fault tolerance between CPU
WO2012103736A1 (en) Data processing node, system and method
CN101645915A (en) Disk array host channel daughter card, on-line switching system and switching method thereof
CN106888142B (en) E1 double-ring network with ring self-healing function
JP2006087102A (en) Apparatus and method for transparent recovery of switching arrangement
CN105763488B (en) Data center aggregation core switch and backboard thereof
CN109995681B (en) Device and method for realizing double-master-control main-standby switching by single chip
JP2016100843A (en) Relay device
WO2009067855A1 (en) Method for implementing a computer system or local area network
US7082100B2 (en) Storage system adapter and method of using same
WO2012000338A1 (en) Method and system for achieving main/standby switch for single boards
WO2014019346A1 (en) Dynamic link configuration device and method for multipath server
CN110968540A (en) Redundant high-speed backplate of two stars types based on VPX
CN113852514A (en) Data processing system with uninterrupted service, processing equipment switching method and connecting equipment

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201180001259.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11867525

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11867525

Country of ref document: EP

Kind code of ref document: A1