CN104579802A

CN104579802A - Method for quickly recovering faults of multi-path server

Info

Publication number: CN104579802A
Application number: CN201510080647.4A
Authority: CN
Inventors: 王岩; 薛广营; 黄小东
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2015-04-29

Abstract

The present invention provides a method for fast fault recovery of multi-channel servers, and relates to multi-channel server architecture technology. The invention allows the DMI bus of the PCH to be connected to the main CPU and a slave CPU through a PCIE switch chip, and the switching of the switch chip is controlled by the PCH and the BMC. Common control; when the slave CPU fails, the system shields the slave CPU; when the master CPU fails, the BIOS or BMC automatically switches the DMI bus to the slave CPU, and shields the failed master CPU, so that the system can quickly recover from the fault Central recovery, that is, the fault shielding of any CPU in the server is realized. Reduce the downtime of server failure recovery, and minimize the loss caused by system downtime due to CPU failure.

Description

A Fast Fault Recovery Method for Multi-way Servers

技术领域 technical field

本发明涉及多路服务器架构技术，尤其涉及一种多路服务器快速故障恢复的方法。 The invention relates to a multi-path server architecture technology, in particular to a method for fast failure recovery of a multi-way server.

背景技术 Background technique

普通的多路服务器架构，南桥芯片（PCH）的DMI总线与主CPU相连接，如图1。在系统开机启动时，PCH从BIOS中获取系统的设置信息、设备驱动程序和自检程序等，并通过与主CPU之间的DMI总线来完成对所有CPU和内存的自检。自检完成后，BIOS会开始引导操作系统，完成开机。在这种服务器结构设计中，系统可以屏蔽掉故障的从CPU，但是如果主CPU出现故障，与PCH之间的DMI总线便无法工作，BIOS程序无法加载，系统无法屏蔽主CPU，必须通过人工更换主CPU的方式完成故障恢复，增加了服务器的宕机时间，这对于关键应用的服务器来说是十分不利的。 In a common multi-channel server architecture, the DMI bus of the south bridge chip (PCH) is connected to the main CPU, as shown in Figure 1. When the system is turned on, PCH obtains system setting information, device drivers, and self-inspection programs from the BIOS, and completes the self-inspection of all CPUs and memory through the DMI bus with the main CPU. After the self-test is completed, the BIOS will start to guide the operating system and complete the boot. In this server structure design, the system can shield the faulty slave CPU, but if the master CPU fails, the DMI bus between the PCH and the PCH cannot work, the BIOS program cannot be loaded, the system cannot shield the master CPU, and must be replaced manually The failure recovery is completed by the main CPU, which increases the downtime of the server, which is very unfavorable for the server of the key application.

发明内容 Contents of the invention

为了解决该问题，本发明提出一种新的多路服务器的快速故障恢复的方法。 In order to solve this problem, the present invention proposes a new fast fault recovery method for multi-path servers.

本发明的技术方案是： Technical scheme of the present invention is:

PCH的DMI总线通过一个PCIE switch芯片与主CPU和一个从CPU相连接，switch芯片的切换由PCH和管理控制器（BMC）共同控制。由于DMI总线使用的是PCIE协议，因此使用PCIE switch芯片可以保证DMI总线的信号完整。在这种设计下，当从CPU出现故障时，系统可将该从CPU屏蔽；当主CPU出现故障时，BIOS或者BMC会自动将DMI总线切换至从CPU下，并且屏蔽掉故障的主CPU，使得系统能够快速从故障中恢复，即实现了服务器中任何一个CPU的故障屏蔽，大幅降低服务器的故障恢复时的宕机时间，将因CPU故障导致系统宕机造成的损失降到最低。使用PCH和BMC双控切换的方式可以保证switch芯片在主CPU出现故障时可以稳定和快速切换。 The DMI bus of the PCH is connected to the master CPU and a slave CPU through a PCIE switch chip, and the switching of the switch chip is jointly controlled by the PCH and the management controller (BMC). Since the DMI bus uses the PCIE protocol, the use of a PCIE switch chip can ensure the signal integrity of the DMI bus. Under this design, when the slave CPU fails, the system can shield the slave CPU; when the master CPU fails, the BIOS or BMC will automatically switch the DMI bus to the slave CPU and shield the failed master CPU, making The system can quickly recover from faults, which means that any CPU fault shielding in the server is realized, which greatly reduces the downtime of the server during fault recovery, and minimizes the losses caused by system downtime caused by CPU faults. Using PCH and BMC dual-control switch mode can ensure that the switch chip can switch stably and quickly when the main CPU fails.

switch芯片的控制信号由PCH的GPIO端口和BMC共同控制，通过控制信号来选择PCH的DMI总线连接到主CPU或者从CPU。 The control signal of the switch chip is jointly controlled by the GPIO port of the PCH and the BMC, and the DMI bus of the PCH is selected to be connected to the master CPU or the slave CPU through the control signal.

Switch芯片默认选择主CPU的DMI总线，控制信号为高电平，默认状态下PCH的GPIO端口和BMC均释放对该控制信号的控制权；当在系统运行时主CPU出现故障后，BMC会检测到主CPU的故障，并自动将控制信号拉低，并进行一次系统重启，重启后完成DMI总线的切换。 The Switch chip selects the DMI bus of the main CPU by default, and the control signal is high level. In the default state, both the GPIO port of the PCH and the BMC release the control over the control signal; when the main CPU fails when the system is running, the BMC will detect When the failure of the main CPU is detected, the control signal is automatically pulled down, and the system is restarted once, and the DMI bus switching is completed after the restart.

当在系统开机自检时主CPU出现故障，BIOS会自动根据CPU的自检代码进行响应，控制PCH的GPIO端口拉低switch芯片的控制信号，切换到从CPU并进行热重启重新自检，完成DMI总线的切换。 When the main CPU fails during the system power-on self-check, the BIOS will automatically respond according to the self-check code of the CPU, control the GPIO port of the PCH to pull down the control signal of the switch chip, switch to the slave CPU and perform a hot restart to re-check the self-check, complete DMI bus switching.

这种设计方法使得当主CPU出现故障时，BIOS或者BMC会自动将DMI总线切换至从CPU下，并且屏蔽掉故障的主CPU，使得系统能够快速从故障中恢复，大幅降低服务器的故障恢复时的宕机时间，将因CPU故障系统宕机造成的损失降到最低。 This design method makes when the main CPU fails, the BIOS or BMC will automatically switch the DMI bus to the slave CPU, and shield the failed main CPU, so that the system can quickly recover from the failure and greatly reduce the recovery time of the server. Downtime minimizes losses caused by system downtime due to CPU failure.

附图说明 Description of drawings

图1是现有技术的连接结构示意图。 Fig. 1 is a schematic diagram of a connection structure in the prior art.

图2是本发明的连接结构示意图。 Fig. 2 is a schematic diagram of the connection structure of the present invention.

具体实施方式 Detailed ways

下面对本发明的内容进行更加详细的阐述： The content of the present invention is described in more detail below:

如图2所示， as shown in picture 2,

1、该发明由主CPU、从CPU、switch芯片、PCH和BMC组成； 1. The invention consists of main CPU, slave CPU, switch chip, PCH and BMC;

2、主CPU和从CPU的DMI总线都连接到switch芯片上，芯片的另一端连接到系统的PCH，switch芯片的控制信号由PCH的GPIO端口和BMC共同控制，通过控制信号来选择PCH的DMI总线连接到主CPU或者从CPU； 2. Both the DMI bus of the master CPU and the slave CPU are connected to the switch chip, and the other end of the chip is connected to the PCH of the system. The control signal of the switch chip is jointly controlled by the GPIO port of the PCH and the BMC, and the DMI of the PCH is selected through the control signal. The bus is connected to the master CPU or the slave CPU;

3、Switch芯片默认选择主CPU的DMI总线（控制信号为高电平），默认状态下PCH的GPIO端口和BMC均释放对该控制信号的控制权。当在系统OS运行时主CPU出现故障后，BMC会检测到主CPU的故障，并自动将控制信号拉低，并进行一次系统重启，重启后完成DMI总线的切换； 3. The Switch chip selects the DMI bus of the main CPU by default (the control signal is high level). In the default state, both the GPIO port of the PCH and the BMC release the control over the control signal. When the main CPU fails when the system OS is running, the BMC will detect the failure of the main CPU, and automatically pull down the control signal, and perform a system restart, and complete the DMI bus switching after restarting;

4、当在系统开机自检时主CPU出现故障，BIOS会自动根据CPU的自检代码进行响应，控制PCH的GPIO端口拉低switch芯片的控制信号，切换到从CPU并进行热重启重新自检，完成DMI总线的切换。 4. When the main CPU fails during the system power-on self-test, the BIOS will automatically respond according to the self-test code of the CPU, control the GPIO port of the PCH to pull down the control signal of the switch chip, switch to the slave CPU and perform a hot restart to re-self-test , to complete the switching of the DMI bus.

Claims

1. a method for multipath server fast failure recovery, is characterized in that,

The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and BMC co-controlling; When breaking down from CPU, system should shield from CPU; When host CPU breaks down, BIOS or BMC automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely achieve the fault masking of any one CPU in server.

2. method according to claim 1, is characterized in that, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.

3. method according to claim 2, is characterized in that, Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.

4. method according to claim 3, it is characterized in that, when when system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, be switched to from CPU and carry out hot restart self-inspection again, completing the switching of DMI bus.