CN104579802A - Method for quickly recovering faults of multi-path server - Google Patents

Method for quickly recovering faults of multi-path server Download PDF

Info

Publication number
CN104579802A
CN104579802A CN201510080647.4A CN201510080647A CN104579802A CN 104579802 A CN104579802 A CN 104579802A CN 201510080647 A CN201510080647 A CN 201510080647A CN 104579802 A CN104579802 A CN 104579802A
Authority
CN
China
Prior art keywords
cpu
pch
fault
host cpu
bmc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510080647.4A
Other languages
Chinese (zh)
Inventor
王岩
薛广营
黄小东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201510080647.4A priority Critical patent/CN104579802A/en
Publication of CN104579802A publication Critical patent/CN104579802A/en
Pending legal-status Critical Current

Links

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The invention provides a method for fast fault recovery of a multi-path server, which relates to the multi-path server architecture technology, and is characterized in that a DMI bus of a PCH is connected with a main CPU and a slave CPU through a PCIE switch chip, and the switch of the switch chip is controlled by the PCH and a BMC together; when the slave CPU fails, the system shields the slave CPU; when the main CPU has a fault, the BIOS or the BMC automatically switches the DMI bus to the lower part of the slave CPU and shields the fault main CPU, so that the system can be quickly recovered from the fault, namely the fault shielding of any CPU in the server is realized. The downtime of the server during fault recovery is reduced, and the loss caused by the downtime of the CPU fault system is minimized.

Description

A kind of method that multipath server fast failure recovers
Technical field
The present invention relates to multipath server architecture technology, particularly relate to a kind of method that multipath server fast failure recovers.
Background technology
Common multipath server framework, the DMI bus of South Bridge chip (PCH) is connected with host CPU, as Fig. 1.When system boot starts, PCH obtains configuration information, the device driver and self-check program etc. of system from BIOS, and has carried out the self-inspection to all CPU and internal memory by the DMI bus between host CPU.After self-inspection completes, BIOS can start to guide operating system, completes start.In this server architecture design, system can mask fault from CPU, if but host CPU breaks down, and the DMI bus between PCH just cannot work, bios program cannot load, and system cannot shield host CPU, must complete fault recovery by the artificial mode changing host CPU, add the downtime of server, this is very disadvantageous for the server that key is applied.
Summary of the invention
In order to solve this problem, the method that the fast failure that the present invention proposes a kind of new multipath server recovers.
Technical scheme of the present invention is:
The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and Management Controller (BMC) co-controlling.Because DMI bus uses PCIE agreement, therefore use PCIE switch chip can ensure the signal integrity of DMI bus.Under this scheme, when breaking down from CPU, system can will should shield from CPU; When host CPU breaks down, BIOS or BMC can automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely the fault masking of any one CPU in server is achieved, significantly reduce the downtime during fault recovery of server, the loss causing the system machine of delaying to cause because of cpu fault is dropped to minimum.The mode using PCH and BMC dual control to switch can ensure that switch chip can be stablized when host CPU breaks down and switch fast.
The control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.
Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.
When system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, is switched to from CPU and carries out hot restart self-inspection again, completing the switching of DMI bus.
This method for designing makes when host CPU breaks down, BIOS or BMC can automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, significantly reduce the downtime during fault recovery of server, the loss caused because of the cpu fault system machine of delaying is dropped to minimum.
Accompanying drawing explanation
Fig. 1 is the syndeton schematic diagram of prior art.
Fig. 2 is syndeton schematic diagram of the present invention.
Embodiment
More detailed elaboration is carried out to content of the present invention below:
As shown in Figure 2,
1, this invention is by host CPU, form from CPU, switch chip, PCH and BMC;
2, host CPU and the DMI bus from CPU are all connected to switch chip, the other end of chip is connected to the PCH of system, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU;
3, Switch chip acquiescence selects the DMI bus (control signal is high level) of host CPU, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal.After when system OS runs, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting;
4, when when system boot self-inspection host CPU break down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, is switched to from CPU and carries out hot restart self-inspection again, completing the switching of DMI bus.

Claims (4)

1. a method for multipath server fast failure recovery, is characterized in that,
The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and BMC co-controlling; When breaking down from CPU, system should shield from CPU; When host CPU breaks down, BIOS or BMC automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely achieve the fault masking of any one CPU in server.
2. method according to claim 1, is characterized in that, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.
3. method according to claim 2, is characterized in that, Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.
4. method according to claim 3, it is characterized in that, when when system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, be switched to from CPU and carry out hot restart self-inspection again, completing the switching of DMI bus.
CN201510080647.4A 2015-02-15 2015-02-15 Method for quickly recovering faults of multi-path server Pending CN104579802A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510080647.4A CN104579802A (en) 2015-02-15 2015-02-15 Method for quickly recovering faults of multi-path server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510080647.4A CN104579802A (en) 2015-02-15 2015-02-15 Method for quickly recovering faults of multi-path server

Publications (1)

Publication Number Publication Date
CN104579802A true CN104579802A (en) 2015-04-29

Family

ID=53095067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510080647.4A Pending CN104579802A (en) 2015-02-15 2015-02-15 Method for quickly recovering faults of multi-path server

Country Status (1)

Country Link
CN (1) CN104579802A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105718333A (en) * 2016-01-26 2016-06-29 山东超越数控电子有限公司 Twin-channel server mainboard main-slave CPU switching device and switching control method thereof
CN106294277A (en) * 2015-12-29 2017-01-04 北京典赞科技有限公司 The SMP of a kind of Based PC IE bus calculates system
CN106844113A (en) * 2017-03-10 2017-06-13 郑州云海信息技术有限公司 The server failure recovery system and method for a kind of use redundancy PCH
CN107003914A (en) * 2016-10-31 2017-08-01 华为技术有限公司 Start the method and enabled device of physical equipment
CN107682179A (en) * 2017-08-31 2018-02-09 郑州云海信息技术有限公司 A kind of server collocation method and device based on prestored information
CN107688540A (en) * 2017-09-11 2018-02-13 郑州云海信息技术有限公司 A kind of method that long-range Debug is carried out using BMC
CN110162502A (en) * 2019-04-15 2019-08-23 深圳市同泰怡信息技术有限公司 A kind of server for realizing various configurations based on central processing unit
CN110764829A (en) * 2019-09-21 2020-02-07 苏州浪潮智能科技有限公司 Multi-path server CPU isolation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894060A (en) * 2010-06-25 2010-11-24 福建星网锐捷网络有限公司 Fault detection method and modular device
US20130173839A1 (en) * 2011-12-31 2013-07-04 Huawei Technologies Co., Ltd. Switch disk array, storage system and data storage path switching method
CN104125049A (en) * 2014-08-08 2014-10-29 浪潮电子信息产业股份有限公司 Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894060A (en) * 2010-06-25 2010-11-24 福建星网锐捷网络有限公司 Fault detection method and modular device
US20130173839A1 (en) * 2011-12-31 2013-07-04 Huawei Technologies Co., Ltd. Switch disk array, storage system and data storage path switching method
CN104125049A (en) * 2014-08-08 2014-10-29 浪潮电子信息产业股份有限公司 Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294277A (en) * 2015-12-29 2017-01-04 北京典赞科技有限公司 The SMP of a kind of Based PC IE bus calculates system
CN105718333A (en) * 2016-01-26 2016-06-29 山东超越数控电子有限公司 Twin-channel server mainboard main-slave CPU switching device and switching control method thereof
WO2018076351A1 (en) * 2016-10-31 2018-05-03 华为技术有限公司 Method and enabling device for starting physical device
CN107003914A (en) * 2016-10-31 2017-08-01 华为技术有限公司 Start the method and enabled device of physical equipment
CN107003914B (en) * 2016-10-31 2020-11-13 华为技术有限公司 Method and enabling device for starting physical equipment
EP3764234A1 (en) * 2016-10-31 2021-01-13 Huawei Technologies Co. Ltd. Method and enable apparatus for starting physical device
US11068348B2 (en) 2016-10-31 2021-07-20 Huawei Technologies Co., Ltd. Method and enable apparatus for starting physical device
CN106844113A (en) * 2017-03-10 2017-06-13 郑州云海信息技术有限公司 The server failure recovery system and method for a kind of use redundancy PCH
CN106844113B (en) * 2017-03-10 2020-09-29 苏州浪潮智能科技有限公司 Server fault recovery system and method adopting redundant PCH
CN107682179A (en) * 2017-08-31 2018-02-09 郑州云海信息技术有限公司 A kind of server collocation method and device based on prestored information
CN107688540A (en) * 2017-09-11 2018-02-13 郑州云海信息技术有限公司 A kind of method that long-range Debug is carried out using BMC
CN110162502A (en) * 2019-04-15 2019-08-23 深圳市同泰怡信息技术有限公司 A kind of server for realizing various configurations based on central processing unit
CN110764829A (en) * 2019-09-21 2020-02-07 苏州浪潮智能科技有限公司 Multi-path server CPU isolation method and system
CN110764829B (en) * 2019-09-21 2022-07-08 苏州浪潮智能科技有限公司 Multi-path server CPU isolation method and system

Similar Documents

Publication Publication Date Title
CN104579802A (en) Method for quickly recovering faults of multi-path server
CN109815043B (en) Fault processing method, related equipment and computer storage medium
CN106528097B (en) A kind of the version synchronization method and electronic equipment of double BIOS/firmwares
CN105974879B (en) Redundant control device, system and control method in digital I&C system
CN203786723U (en) Dual redundant system based on X86 PC/104 embedded CPU modules
CN104050061B (en) A kind of Based PC Ie bus many master control board redundancies standby system
US10223103B2 (en) Rom flashing method and intelligent terminal
CN103136048B (en) Computer system
CN104503783A (en) Method and server for presenting initialization degree of server hardware
US20200310933A1 (en) Device fault processing method, apparatus, and system
CN104424037B (en) A kind of method and device of dynamic patch function
CN107844330A (en) A kind of method and system of enhancing ARM startup of server code reliabilities
CN103488498A (en) Computer booting method and computer
CN105242980A (en) Complementary watchdog system and complementary watchdog monitoring method
CN102708027B (en) A kind of method and system avoiding outage of communication device
US20140157051A1 (en) Method and device for debugging a mips-structure cpu with southbridge and northbridge chipsets
CN104484243B (en) A kind of highly reliable system and device and method of virtual machine fault-toleranr technique and the combination of high-availability cluster technology
CN104731675A (en) Intelligent redundancy backup method for BIOS in server system
CN104461762A (en) Automatic restart method for a halted device
CN102200933A (en) System BIOS (Basic Input/Output System) automatic restoring method based on dual SPI (Serial Peripheral interface) Flashes
CN104125049A (en) Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform
CN103235755A (en) Basic input output system (BIOS) remote network debugging method
US20120030504A1 (en) High reliability computer system and its configuration method
CN103873516A (en) HA method and system for improving usage rate of physical servers in cloud computing resource pool
US9026838B2 (en) Computer system, host-bus-adaptor control method, and program thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150429

WD01 Invention patent application deemed withdrawn after publication