CN104579802A - Method for quickly recovering faults of multi-path server - Google Patents
Method for quickly recovering faults of multi-path server Download PDFInfo
- Publication number
- CN104579802A CN104579802A CN201510080647.4A CN201510080647A CN104579802A CN 104579802 A CN104579802 A CN 104579802A CN 201510080647 A CN201510080647 A CN 201510080647A CN 104579802 A CN104579802 A CN 104579802A
- Authority
- CN
- China
- Prior art keywords
- cpu
- pch
- fault
- host cpu
- bmc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 10
- 238000011084 recovery Methods 0.000 claims abstract description 6
- 238000007689 inspection Methods 0.000 claims description 11
- 230000000873 masking effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 1
Landscapes
- Hardware Redundancy (AREA)
Abstract
The invention provides a method for fast fault recovery of a multi-path server, which relates to the multi-path server architecture technology, and is characterized in that a DMI bus of a PCH is connected with a main CPU and a slave CPU through a PCIE switch chip, and the switch of the switch chip is controlled by the PCH and a BMC together; when the slave CPU fails, the system shields the slave CPU; when the main CPU has a fault, the BIOS or the BMC automatically switches the DMI bus to the lower part of the slave CPU and shields the fault main CPU, so that the system can be quickly recovered from the fault, namely the fault shielding of any CPU in the server is realized. The downtime of the server during fault recovery is reduced, and the loss caused by the downtime of the CPU fault system is minimized.
Description
Technical field
The present invention relates to multipath server architecture technology, particularly relate to a kind of method that multipath server fast failure recovers.
Background technology
Common multipath server framework, the DMI bus of South Bridge chip (PCH) is connected with host CPU, as Fig. 1.When system boot starts, PCH obtains configuration information, the device driver and self-check program etc. of system from BIOS, and has carried out the self-inspection to all CPU and internal memory by the DMI bus between host CPU.After self-inspection completes, BIOS can start to guide operating system, completes start.In this server architecture design, system can mask fault from CPU, if but host CPU breaks down, and the DMI bus between PCH just cannot work, bios program cannot load, and system cannot shield host CPU, must complete fault recovery by the artificial mode changing host CPU, add the downtime of server, this is very disadvantageous for the server that key is applied.
Summary of the invention
In order to solve this problem, the method that the fast failure that the present invention proposes a kind of new multipath server recovers.
Technical scheme of the present invention is:
The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and Management Controller (BMC) co-controlling.Because DMI bus uses PCIE agreement, therefore use PCIE switch chip can ensure the signal integrity of DMI bus.Under this scheme, when breaking down from CPU, system can will should shield from CPU; When host CPU breaks down, BIOS or BMC can automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely the fault masking of any one CPU in server is achieved, significantly reduce the downtime during fault recovery of server, the loss causing the system machine of delaying to cause because of cpu fault is dropped to minimum.The mode using PCH and BMC dual control to switch can ensure that switch chip can be stablized when host CPU breaks down and switch fast.
The control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.
Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.
When system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, is switched to from CPU and carries out hot restart self-inspection again, completing the switching of DMI bus.
This method for designing makes when host CPU breaks down, BIOS or BMC can automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, significantly reduce the downtime during fault recovery of server, the loss caused because of the cpu fault system machine of delaying is dropped to minimum.
Accompanying drawing explanation
Fig. 1 is the syndeton schematic diagram of prior art.
Fig. 2 is syndeton schematic diagram of the present invention.
Embodiment
More detailed elaboration is carried out to content of the present invention below:
As shown in Figure 2,
1, this invention is by host CPU, form from CPU, switch chip, PCH and BMC;
2, host CPU and the DMI bus from CPU are all connected to switch chip, the other end of chip is connected to the PCH of system, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU;
3, Switch chip acquiescence selects the DMI bus (control signal is high level) of host CPU, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal.After when system OS runs, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting;
4, when when system boot self-inspection host CPU break down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, is switched to from CPU and carries out hot restart self-inspection again, completing the switching of DMI bus.
Claims (4)
1. a method for multipath server fast failure recovery, is characterized in that,
The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and BMC co-controlling; When breaking down from CPU, system should shield from CPU; When host CPU breaks down, BIOS or BMC automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely achieve the fault masking of any one CPU in server.
2. method according to claim 1, is characterized in that, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.
3. method according to claim 2, is characterized in that, Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.
4. method according to claim 3, it is characterized in that, when when system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, be switched to from CPU and carry out hot restart self-inspection again, completing the switching of DMI bus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510080647.4A CN104579802A (en) | 2015-02-15 | 2015-02-15 | Method for quickly recovering faults of multi-path server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510080647.4A CN104579802A (en) | 2015-02-15 | 2015-02-15 | Method for quickly recovering faults of multi-path server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104579802A true CN104579802A (en) | 2015-04-29 |
Family
ID=53095067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510080647.4A Pending CN104579802A (en) | 2015-02-15 | 2015-02-15 | Method for quickly recovering faults of multi-path server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104579802A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105718333A (en) * | 2016-01-26 | 2016-06-29 | 山东超越数控电子有限公司 | Twin-channel server mainboard main-slave CPU switching device and switching control method thereof |
CN106294277A (en) * | 2015-12-29 | 2017-01-04 | 北京典赞科技有限公司 | The SMP of a kind of Based PC IE bus calculates system |
CN106844113A (en) * | 2017-03-10 | 2017-06-13 | 郑州云海信息技术有限公司 | The server failure recovery system and method for a kind of use redundancy PCH |
CN107003914A (en) * | 2016-10-31 | 2017-08-01 | 华为技术有限公司 | Start the method and enabled device of physical equipment |
CN107682179A (en) * | 2017-08-31 | 2018-02-09 | 郑州云海信息技术有限公司 | A kind of server collocation method and device based on prestored information |
CN107688540A (en) * | 2017-09-11 | 2018-02-13 | 郑州云海信息技术有限公司 | A kind of method that long-range Debug is carried out using BMC |
CN110162502A (en) * | 2019-04-15 | 2019-08-23 | 深圳市同泰怡信息技术有限公司 | A kind of server for realizing various configurations based on central processing unit |
CN110764829A (en) * | 2019-09-21 | 2020-02-07 | 苏州浪潮智能科技有限公司 | Multi-path server CPU isolation method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894060A (en) * | 2010-06-25 | 2010-11-24 | 福建星网锐捷网络有限公司 | Fault detection method and modular device |
US20130173839A1 (en) * | 2011-12-31 | 2013-07-04 | Huawei Technologies Co., Ltd. | Switch disk array, storage system and data storage path switching method |
CN104125049A (en) * | 2014-08-08 | 2014-10-29 | 浪潮电子信息产业股份有限公司 | Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform |
-
2015
- 2015-02-15 CN CN201510080647.4A patent/CN104579802A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894060A (en) * | 2010-06-25 | 2010-11-24 | 福建星网锐捷网络有限公司 | Fault detection method and modular device |
US20130173839A1 (en) * | 2011-12-31 | 2013-07-04 | Huawei Technologies Co., Ltd. | Switch disk array, storage system and data storage path switching method |
CN104125049A (en) * | 2014-08-08 | 2014-10-29 | 浪潮电子信息产业股份有限公司 | Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294277A (en) * | 2015-12-29 | 2017-01-04 | 北京典赞科技有限公司 | The SMP of a kind of Based PC IE bus calculates system |
CN105718333A (en) * | 2016-01-26 | 2016-06-29 | 山东超越数控电子有限公司 | Twin-channel server mainboard main-slave CPU switching device and switching control method thereof |
WO2018076351A1 (en) * | 2016-10-31 | 2018-05-03 | 华为技术有限公司 | Method and enabling device for starting physical device |
CN107003914A (en) * | 2016-10-31 | 2017-08-01 | 华为技术有限公司 | Start the method and enabled device of physical equipment |
CN107003914B (en) * | 2016-10-31 | 2020-11-13 | 华为技术有限公司 | Method and enabling device for starting physical equipment |
EP3764234A1 (en) * | 2016-10-31 | 2021-01-13 | Huawei Technologies Co. Ltd. | Method and enable apparatus for starting physical device |
US11068348B2 (en) | 2016-10-31 | 2021-07-20 | Huawei Technologies Co., Ltd. | Method and enable apparatus for starting physical device |
CN106844113A (en) * | 2017-03-10 | 2017-06-13 | 郑州云海信息技术有限公司 | The server failure recovery system and method for a kind of use redundancy PCH |
CN106844113B (en) * | 2017-03-10 | 2020-09-29 | 苏州浪潮智能科技有限公司 | Server fault recovery system and method adopting redundant PCH |
CN107682179A (en) * | 2017-08-31 | 2018-02-09 | 郑州云海信息技术有限公司 | A kind of server collocation method and device based on prestored information |
CN107688540A (en) * | 2017-09-11 | 2018-02-13 | 郑州云海信息技术有限公司 | A kind of method that long-range Debug is carried out using BMC |
CN110162502A (en) * | 2019-04-15 | 2019-08-23 | 深圳市同泰怡信息技术有限公司 | A kind of server for realizing various configurations based on central processing unit |
CN110764829A (en) * | 2019-09-21 | 2020-02-07 | 苏州浪潮智能科技有限公司 | Multi-path server CPU isolation method and system |
CN110764829B (en) * | 2019-09-21 | 2022-07-08 | 苏州浪潮智能科技有限公司 | Multi-path server CPU isolation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104579802A (en) | Method for quickly recovering faults of multi-path server | |
CN109815043B (en) | Fault processing method, related equipment and computer storage medium | |
CN106528097B (en) | A kind of the version synchronization method and electronic equipment of double BIOS/firmwares | |
CN105974879B (en) | Redundant control device, system and control method in digital I&C system | |
CN203786723U (en) | Dual redundant system based on X86 PC/104 embedded CPU modules | |
CN104050061B (en) | A kind of Based PC Ie bus many master control board redundancies standby system | |
US10223103B2 (en) | Rom flashing method and intelligent terminal | |
CN103136048B (en) | Computer system | |
CN104503783A (en) | Method and server for presenting initialization degree of server hardware | |
US20200310933A1 (en) | Device fault processing method, apparatus, and system | |
CN104424037B (en) | A kind of method and device of dynamic patch function | |
CN107844330A (en) | A kind of method and system of enhancing ARM startup of server code reliabilities | |
CN103488498A (en) | Computer booting method and computer | |
CN105242980A (en) | Complementary watchdog system and complementary watchdog monitoring method | |
CN102708027B (en) | A kind of method and system avoiding outage of communication device | |
US20140157051A1 (en) | Method and device for debugging a mips-structure cpu with southbridge and northbridge chipsets | |
CN104484243B (en) | A kind of highly reliable system and device and method of virtual machine fault-toleranr technique and the combination of high-availability cluster technology | |
CN104731675A (en) | Intelligent redundancy backup method for BIOS in server system | |
CN104461762A (en) | Automatic restart method for a halted device | |
CN102200933A (en) | System BIOS (Basic Input/Output System) automatic restoring method based on dual SPI (Serial Peripheral interface) Flashes | |
CN104125049A (en) | Redundancy implementation method of PCIE (Peripheral Component Interface Express) device based on BRICKLAND platform | |
CN103235755A (en) | Basic input output system (BIOS) remote network debugging method | |
US20120030504A1 (en) | High reliability computer system and its configuration method | |
CN103873516A (en) | HA method and system for improving usage rate of physical servers in cloud computing resource pool | |
US9026838B2 (en) | Computer system, host-bus-adaptor control method, and program thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150429 |
|
WD01 | Invention patent application deemed withdrawn after publication |