CN104579802A - Method for quickly recovering faults of multi-path server - Google Patents
Method for quickly recovering faults of multi-path server Download PDFInfo
- Publication number
- CN104579802A CN104579802A CN201510080647.4A CN201510080647A CN104579802A CN 104579802 A CN104579802 A CN 104579802A CN 201510080647 A CN201510080647 A CN 201510080647A CN 104579802 A CN104579802 A CN 104579802A
- Authority
- CN
- China
- Prior art keywords
- cpu
- pch
- bmc
- control signal
- switch chip
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Hardware Redundancy (AREA)
Abstract
本发明提供一种多路服务器快速故障恢复的方法,涉及多路服务器架构技术,发明让PCH的DMI总线通过一个PCIE switch芯片与主CPU和一个从CPU相连接,switch芯片的切换由PCH和BMC共同控制;当从CPU出现故障时,系统将该从CPU屏蔽;当主CPU出现故障时,BIOS或者BMC自动将DMI总线切换至从CPU下,并且屏蔽掉故障的主CPU,使得系统能够快速从故障中恢复,即实现了服务器中任何一个CPU的故障屏蔽。降低服务器的故障恢复时的宕机时间,将因CPU故障系统宕机造成的损失降到最低。
The present invention provides a method for fast fault recovery of multi-channel servers, and relates to multi-channel server architecture technology. The invention allows the DMI bus of the PCH to be connected to the main CPU and a slave CPU through a PCIE switch chip, and the switching of the switch chip is controlled by the PCH and the BMC. Common control; when the slave CPU fails, the system shields the slave CPU; when the master CPU fails, the BIOS or BMC automatically switches the DMI bus to the slave CPU, and shields the failed master CPU, so that the system can quickly recover from the fault Central recovery, that is, the fault shielding of any CPU in the server is realized. Reduce the downtime of server failure recovery, and minimize the loss caused by system downtime due to CPU failure.
Description
技术领域 technical field
本发明涉及多路服务器架构技术,尤其涉及一种多路服务器快速故障恢复的方法。 The invention relates to a multi-path server architecture technology, in particular to a method for fast failure recovery of a multi-way server.
背景技术 Background technique
普通的多路服务器架构,南桥芯片(PCH)的DMI总线与主CPU相连接,如图1。在系统开机启动时,PCH从BIOS中获取系统的设置信息、设备驱动程序和自检程序等,并通过与主CPU之间的DMI总线来完成对所有CPU和内存的自检。自检完成后,BIOS会开始引导操作系统,完成开机。在这种服务器结构设计中,系统可以屏蔽掉故障的从CPU,但是如果主CPU出现故障,与PCH之间的DMI总线便无法工作,BIOS程序无法加载,系统无法屏蔽主CPU,必须通过人工更换主CPU的方式完成故障恢复,增加了服务器的宕机时间,这对于关键应用的服务器来说是十分不利的。 In a common multi-channel server architecture, the DMI bus of the south bridge chip (PCH) is connected to the main CPU, as shown in Figure 1. When the system is turned on, PCH obtains system setting information, device drivers, and self-inspection programs from the BIOS, and completes the self-inspection of all CPUs and memory through the DMI bus with the main CPU. After the self-test is completed, the BIOS will start to guide the operating system and complete the boot. In this server structure design, the system can shield the faulty slave CPU, but if the master CPU fails, the DMI bus between the PCH and the PCH cannot work, the BIOS program cannot be loaded, the system cannot shield the master CPU, and must be replaced manually The failure recovery is completed by the main CPU, which increases the downtime of the server, which is very unfavorable for the server of the key application.
发明内容 Contents of the invention
为了解决该问题,本发明提出一种新的多路服务器的快速故障恢复的方法。 In order to solve this problem, the present invention proposes a new fast fault recovery method for multi-path servers.
本发明的技术方案是: Technical scheme of the present invention is:
PCH的DMI总线通过一个PCIE switch芯片与主CPU和一个从CPU相连接,switch芯片的切换由PCH和管理控制器(BMC)共同控制。由于DMI总线使用的是PCIE协议,因此使用PCIE switch芯片可以保证DMI总线的信号完整。在这种设计下,当从CPU出现故障时,系统可将该从CPU屏蔽;当主CPU出现故障时,BIOS或者BMC会自动将DMI总线切换至从CPU下,并且屏蔽掉故障的主CPU,使得系统能够快速从故障中恢复,即实现了服务器中任何一个CPU的故障屏蔽,大幅降低服务器的故障恢复时的宕机时间,将因CPU故障导致系统宕机造成的损失降到最低。使用PCH和BMC双控切换的方式可以保证switch芯片在主CPU出现故障时可以稳定和快速切换。 The DMI bus of the PCH is connected to the master CPU and a slave CPU through a PCIE switch chip, and the switching of the switch chip is jointly controlled by the PCH and the management controller (BMC). Since the DMI bus uses the PCIE protocol, the use of a PCIE switch chip can ensure the signal integrity of the DMI bus. Under this design, when the slave CPU fails, the system can shield the slave CPU; when the master CPU fails, the BIOS or BMC will automatically switch the DMI bus to the slave CPU and shield the failed master CPU, making The system can quickly recover from faults, which means that any CPU fault shielding in the server is realized, which greatly reduces the downtime of the server during fault recovery, and minimizes the losses caused by system downtime caused by CPU faults. Using PCH and BMC dual-control switch mode can ensure that the switch chip can switch stably and quickly when the main CPU fails.
switch芯片的控制信号由PCH的GPIO端口和BMC共同控制,通过控制信号来选择PCH的DMI总线连接到主CPU或者从CPU。 The control signal of the switch chip is jointly controlled by the GPIO port of the PCH and the BMC, and the DMI bus of the PCH is selected to be connected to the master CPU or the slave CPU through the control signal.
Switch芯片默认选择主CPU的DMI总线,控制信号为高电平,默认状态下PCH的GPIO端口和BMC均释放对该控制信号的控制权;当在系统运行时主CPU出现故障后,BMC会检测到主CPU的故障,并自动将控制信号拉低,并进行一次系统重启,重启后完成DMI总线的切换。 The Switch chip selects the DMI bus of the main CPU by default, and the control signal is high level. In the default state, both the GPIO port of the PCH and the BMC release the control over the control signal; when the main CPU fails when the system is running, the BMC will detect When the failure of the main CPU is detected, the control signal is automatically pulled down, and the system is restarted once, and the DMI bus switching is completed after the restart.
当在系统开机自检时主CPU出现故障,BIOS会自动根据CPU的自检代码进行响应,控制PCH的GPIO端口拉低switch芯片的控制信号,切换到从CPU并进行热重启重新自检,完成DMI总线的切换。 When the main CPU fails during the system power-on self-check, the BIOS will automatically respond according to the self-check code of the CPU, control the GPIO port of the PCH to pull down the control signal of the switch chip, switch to the slave CPU and perform a hot restart to re-check the self-check, complete DMI bus switching.
这种设计方法使得当主CPU出现故障时,BIOS或者BMC会自动将DMI总线切换至从CPU下,并且屏蔽掉故障的主CPU,使得系统能够快速从故障中恢复,大幅降低服务器的故障恢复时的宕机时间,将因CPU故障系统宕机造成的损失降到最低。 This design method makes when the main CPU fails, the BIOS or BMC will automatically switch the DMI bus to the slave CPU, and shield the failed main CPU, so that the system can quickly recover from the failure and greatly reduce the recovery time of the server. Downtime minimizes losses caused by system downtime due to CPU failure.
附图说明 Description of drawings
图1是现有技术的连接结构示意图。 Fig. 1 is a schematic diagram of a connection structure in the prior art.
图2是本发明的连接结构示意图。 Fig. 2 is a schematic diagram of the connection structure of the present invention.
具体实施方式 Detailed ways
下面对本发明的内容进行更加详细的阐述: The content of the present invention is described in more detail below:
如图2所示, as shown in picture 2,
1、该发明由主CPU、从CPU、switch芯片、PCH和BMC组成; 1. The invention consists of main CPU, slave CPU, switch chip, PCH and BMC;
2、主CPU和从CPU的DMI总线都连接到switch芯片上,芯片的另一端连接到系统的PCH,switch芯片的控制信号由PCH的GPIO端口和BMC共同控制,通过控制信号来选择PCH的DMI总线连接到主CPU或者从CPU; 2. Both the DMI bus of the master CPU and the slave CPU are connected to the switch chip, and the other end of the chip is connected to the PCH of the system. The control signal of the switch chip is jointly controlled by the GPIO port of the PCH and the BMC, and the DMI of the PCH is selected through the control signal. The bus is connected to the master CPU or the slave CPU;
3、Switch芯片默认选择主CPU的DMI总线(控制信号为高电平),默认状态下PCH的GPIO端口和BMC均释放对该控制信号的控制权。当在系统OS运行时主CPU出现故障后,BMC会检测到主CPU的故障,并自动将控制信号拉低,并进行一次系统重启,重启后完成DMI总线的切换; 3. The Switch chip selects the DMI bus of the main CPU by default (the control signal is high level). In the default state, both the GPIO port of the PCH and the BMC release the control over the control signal. When the main CPU fails when the system OS is running, the BMC will detect the failure of the main CPU, and automatically pull down the control signal, and perform a system restart, and complete the DMI bus switching after restarting;
4、当在系统开机自检时主CPU出现故障,BIOS会自动根据CPU的自检代码进行响应,控制PCH的GPIO端口拉低switch芯片的控制信号,切换到从CPU并进行热重启重新自检,完成DMI总线的切换。 4. When the main CPU fails during the system power-on self-test, the BIOS will automatically respond according to the self-test code of the CPU, control the GPIO port of the PCH to pull down the control signal of the switch chip, switch to the slave CPU and perform a hot restart to re-self-test , to complete the switching of the DMI bus.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510080647.4A CN104579802A (en) | 2015-02-15 | 2015-02-15 | Method for quickly recovering faults of multi-path server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510080647.4A CN104579802A (en) | 2015-02-15 | 2015-02-15 | Method for quickly recovering faults of multi-path server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104579802A true CN104579802A (en) | 2015-04-29 |
Family
ID=53095067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510080647.4A Pending CN104579802A (en) | 2015-02-15 | 2015-02-15 | Method for quickly recovering faults of multi-path server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104579802A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105718333A (en) * | 2016-01-26 | 2016-06-29 | 山东超越数控电子有限公司 | Twin-channel server mainboard main-slave CPU switching device and switching control method thereof |
CN106294277A (en) * | 2015-12-29 | 2017-01-04 | 北京典赞科技有限公司 | The SMP of a kind of Based PC IE bus calculates system |
CN106844113A (en) * | 2017-03-10 | 2017-06-13 | 郑州云海信息技术有限公司 | The server failure recovery system and method for a kind of use redundancy PCH |
CN107003914A (en) * | 2016-10-31 | 2017-08-01 | 华为技术有限公司 | Start the method and enabled device of physical equipment |
CN107682179A (en) * | 2017-08-31 | 2018-02-09 | 郑州云海信息技术有限公司 | A kind of server collocation method and device based on prestored information |
CN107688540A (en) * | 2017-09-11 | 2018-02-13 | 郑州云海信息技术有限公司 | A kind of method that long-range Debug is carried out using BMC |
CN110162502A (en) * | 2019-04-15 | 2019-08-23 | 深圳市同泰怡信息技术有限公司 | A kind of server for realizing various configurations based on central processing unit |
CN110764829A (en) * | 2019-09-21 | 2020-02-07 | 苏州浪潮智能科技有限公司 | A method and system for isolating CPU of a multi-channel server |
CN115454730A (en) * | 2022-09-16 | 2022-12-09 | 苏州浪潮智能科技有限公司 | A server dual redundant CPU device and switching method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894060A (en) * | 2010-06-25 | 2010-11-24 | 福建星网锐捷网络有限公司 | Fault detection method and modular device |
US20130173839A1 (en) * | 2011-12-31 | 2013-07-04 | Huawei Technologies Co., Ltd. | Switch disk array, storage system and data storage path switching method |
CN104125049A (en) * | 2014-08-08 | 2014-10-29 | 浪潮电子信息产业股份有限公司 | Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform |
-
2015
- 2015-02-15 CN CN201510080647.4A patent/CN104579802A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894060A (en) * | 2010-06-25 | 2010-11-24 | 福建星网锐捷网络有限公司 | Fault detection method and modular device |
US20130173839A1 (en) * | 2011-12-31 | 2013-07-04 | Huawei Technologies Co., Ltd. | Switch disk array, storage system and data storage path switching method |
CN104125049A (en) * | 2014-08-08 | 2014-10-29 | 浪潮电子信息产业股份有限公司 | Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294277A (en) * | 2015-12-29 | 2017-01-04 | 北京典赞科技有限公司 | The SMP of a kind of Based PC IE bus calculates system |
CN105718333A (en) * | 2016-01-26 | 2016-06-29 | 山东超越数控电子有限公司 | Twin-channel server mainboard main-slave CPU switching device and switching control method thereof |
WO2018076351A1 (en) * | 2016-10-31 | 2018-05-03 | 华为技术有限公司 | Method and enabling device for starting physical device |
CN107003914A (en) * | 2016-10-31 | 2017-08-01 | 华为技术有限公司 | Start the method and enabled device of physical equipment |
CN107003914B (en) * | 2016-10-31 | 2020-11-13 | 华为技术有限公司 | Method and enabling device for starting physical equipment |
EP3764234A1 (en) * | 2016-10-31 | 2021-01-13 | Huawei Technologies Co. Ltd. | Method and enable apparatus for starting physical device |
US11068348B2 (en) | 2016-10-31 | 2021-07-20 | Huawei Technologies Co., Ltd. | Method and enable apparatus for starting physical device |
CN106844113A (en) * | 2017-03-10 | 2017-06-13 | 郑州云海信息技术有限公司 | The server failure recovery system and method for a kind of use redundancy PCH |
CN106844113B (en) * | 2017-03-10 | 2020-09-29 | 苏州浪潮智能科技有限公司 | Server fault recovery system and method adopting redundant PCH |
CN107682179A (en) * | 2017-08-31 | 2018-02-09 | 郑州云海信息技术有限公司 | A kind of server collocation method and device based on prestored information |
CN107688540A (en) * | 2017-09-11 | 2018-02-13 | 郑州云海信息技术有限公司 | A kind of method that long-range Debug is carried out using BMC |
CN110162502A (en) * | 2019-04-15 | 2019-08-23 | 深圳市同泰怡信息技术有限公司 | A kind of server for realizing various configurations based on central processing unit |
CN110764829A (en) * | 2019-09-21 | 2020-02-07 | 苏州浪潮智能科技有限公司 | A method and system for isolating CPU of a multi-channel server |
CN110764829B (en) * | 2019-09-21 | 2022-07-08 | 苏州浪潮智能科技有限公司 | Multi-path server CPU isolation method and system |
CN115454730A (en) * | 2022-09-16 | 2022-12-09 | 苏州浪潮智能科技有限公司 | A server dual redundant CPU device and switching method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104579802A (en) | Method for quickly recovering faults of multi-path server | |
US10671498B2 (en) | Method and apparatus for redundancy in active-active cluster system | |
TWI633487B (en) | Method and computer system for automatically recovering the bios image file | |
CN102419719A (en) | Computer system and method for starting same | |
CN104050061B (en) | A kind of Based PC Ie bus many master control board redundancies standby system | |
CN109815043A (en) | Troubleshooting method, related equipment and computer storage medium | |
WO2016091033A1 (en) | Method and server for presenting initialization degree of hardware in server | |
CN104570721B (en) | Redundant manipulator master slave mode determines method | |
CN105700969A (en) | Server system | |
CN104199753B (en) | A kind of virtual machine application service fault recovery system and its fault recovery method | |
CN107844330A (en) | A kind of method and system of enhancing ARM startup of server code reliabilities | |
CN105653405B (en) | A kind of fault handling method and system of Generic Bootstrap | |
CN107168829B (en) | Method and system for ensuring safe and reliable operation of double BIOS of server system | |
WO2016033941A1 (en) | Boot on-line upgrading apparatus and method | |
US10379931B2 (en) | Computer system | |
CN102891762B (en) | The system and method for network data continuously | |
CN108737153A (en) | Block chain disaster recovery and backup systems, method, server and computer readable storage medium | |
CN106886441A (en) | A kind of server system and FLASH collocation methods | |
US20120030504A1 (en) | High reliability computer system and its configuration method | |
TWI528287B (en) | Server system | |
CN111858148A (en) | A PCIE Switch chip configuration file recovery system and method | |
CN102426512A (en) | A virtualization-based implementation method of storage dual-controller disk array | |
TWI857744B (en) | A dual processing module device and control method thereof | |
CN104166599A (en) | Method for recovering delivery configuration by restarting ARM device | |
TWI541724B (en) | Circuit and method for writing bios code into bios |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150429 |