WO2016106965A1 - Server self-healing method and device - Google Patents

Server self-healing method and device Download PDF

Info

Publication number
WO2016106965A1
WO2016106965A1 PCT/CN2015/073265 CN2015073265W WO2016106965A1 WO 2016106965 A1 WO2016106965 A1 WO 2016106965A1 CN 2015073265 W CN2015073265 W CN 2015073265W WO 2016106965 A1 WO2016106965 A1 WO 2016106965A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
bmc
information
abnormal
abnormality
Prior art date
Application number
PCT/CN2015/073265
Other languages
French (fr)
Chinese (zh)
Inventor
李军
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016106965A1 publication Critical patent/WO2016106965A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance

Definitions

  • the present invention relates to the field of servers, and in particular to a method and apparatus for server self-healing.
  • the BMC Baseband Management Controller
  • the management server is powered on and off. When the server is abnormal, the alarm is processed and alarmed.
  • the BMC exists as a stand-alone firmware. It can accept the SMM command and report the monitored server exception information to the SMM (System Management Module). It can also provide B/C (Browser/Client, management interface).
  • the browser/client accepts the B/C control command or the issued control policy and returns the current or historical health status of the B/C server.
  • the reliability of the server memory directly affects the stability and reliability of the board. If the memory is faulty, the service is interrupted. In severe cases, the system may be down.
  • Embodiments of the present invention provide a method and apparatus for server self-healing to reduce the problem of manual field intervention and operation server failure.
  • an embodiment of the present invention provides a method for server self-healing, the method comprising:
  • the outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, and the abnormal information includes a memory exception type and an abnormal memory module identifier;
  • the BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board;
  • the BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
  • the memory exception type includes an unrecoverable memory error
  • the BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:
  • the BMC When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the memory module identifier, and Performing a power-off and power-on operation on the server board;
  • the BMC forwards the abnormality information to the SMM,
  • the SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
  • the exception type includes a recoverable memory error
  • the BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:
  • the BMC When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error count on the abnormal memory bar corresponding to the abnormality information. And frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC according to the abnormal memory Generates isolated memory information and performs power-off and power-on operations on the server board.
  • the BMC forwards the abnormality information to the SMM; the SMM pair
  • the abnormal memory bar corresponding to the abnormal information performs the recoverable memory error frequency and frequency statistics.
  • the SMM is based on the abnormal memory module. The information is generated to isolate the memory information, and the server board is powered off and then powered on.
  • the sending, by the BMC, the isolated memory information to the BIOS includes:
  • the BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.
  • the BMC after receiving the abnormal information sent by the BIOS, the BMC further includes:
  • the BMC sends the exception information to the interface browser B/client C.
  • the invention also provides a device for self-healing of a server, the device comprising:
  • An information processing module is configured to receive abnormal information sent by a BIOS of a basic input/output system of the server board, where the abnormal information includes an abnormal type and an abnormal memory stick identifier;
  • An exception processing module is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;
  • the isolation module is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding memory.
  • the exception handling module is set to:
  • the isolated memory information is generated by the BMC according to the abnormal memory module identifier, and Performing a power-off and power-on operation on the server board;
  • the abnormal information is forwarded by the BMC to the SMM, by the The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
  • the exception handling module is configured to:
  • the BMC When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information.
  • the number of times and frequency statistics when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;
  • the abnormal information is forwarded by the BMC to the SMM;
  • the SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information.
  • the SMM is based on the abnormality.
  • the information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.
  • the isolating module is configured to send the isolated memory information to the BIOS, where:
  • the information processing module is further configured to send the exception information to the interface browser B/client C.
  • the embodiment of the present invention further provides a computer readable storage medium, where the storage medium stores a computer program, where the computer program includes program instructions, when the program instruction is executed by the server device, enabling the device to execute the server itself. The more the method.
  • the above solution passes BMC, BIOS (Basic Input Output System, basic input and output system
  • BIOS Basic Input Output System, basic input and output system
  • FIG. 1 is a schematic structural diagram of a server management system according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method for self-healing of a server according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a device for self-healing of a server according to an embodiment of the present invention
  • FIG. 4 is a flow chart of a method for server self-healing according to another embodiment of the present invention.
  • the server management system structure shown in FIG. 1 includes an SMM and a plurality of slave nodes, that is, BMCs on each server, and each server has a BIOS.
  • the SMM is connected to the BMC of each server through various methods such as IPMB (Intelligent Platform Management BUS)/LAN (Local Area Network), and the BMC and the BIOS can communicate through various types of physical channels.
  • the system structure provides a physical channel for SMM to manage server memory anomalies.
  • the server uses the memory that supports the ECC function, and provides hardware prerequisites for timely discovering memory exceptions.
  • the main function of the B/C is to configure the BMC to handle memory exceptions.
  • configure a policy such as restarting a board and isolating the fault when the frequency of recoverable memory faults of a certain memory module is greater than a certain threshold.
  • the B/C can also query the memory failure and provide a power-down interface on the board.
  • an embodiment of the present invention provides a method for self-healing a server, where the method includes:
  • Step S100 The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, where the abnormal information includes a memory exception type and an abnormal memory barcode. knowledge;
  • Step S102 The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board.
  • Step S104 The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
  • the exception type includes an unrecoverable memory error
  • the BMC or the system management module SMM generates the isolated memory information according to the abnormal information, and performs corresponding processing on the board, including:
  • the BMC When the abnormality type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is configured with a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier, and generates The board performs power-off and power-on operations;
  • the BMC forwards the abnormality information to the SMM, where the SMM is The abnormal memory module identifier generates the isolated memory information, and performs power-off and power-on operations on the board.
  • the type of exception includes a recoverable memory error
  • the BMC or the system management module SMM generates the isolated memory information according to the abnormal information, and performs corresponding processing on the board, including:
  • the BMC When the abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error number of the abnormal memory bar corresponding to the abnormality information.
  • Frequency statistics When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates isolated memory information based on the information of the abnormal memory module, and performs power-off on the board. Power-on operation;
  • the BMC forwards the abnormality information to the SMM; the SMM The abnormal memory corresponding to the exception information performs a recoverable memory error.
  • the SMM generates the isolated memory information according to the information of the abnormal memory module, and performs the execution on the board when the number of the recoverable memory errors or the frequency reaches the set isolation threshold. The power is then powered on.
  • the sending, by the BMC, the isolated memory information to the BIOS includes:
  • the BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.
  • the BMC after receiving the abnormality information sent by the BIOS, the BMC further includes:
  • the BMC sends the exception information to the interface browser B/client C.
  • an embodiment of the present invention further provides a device for self-healing a server, where the device includes a processor, a program storage device, and a data storage device, and further includes:
  • the information processing module 11 is configured to receive abnormal information sent by the BIOS of the basic input/output system of the server board, where the abnormal information includes a memory abnormal type and an abnormal memory barcode identifier;
  • the exception processing module 12 is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;
  • the isolation module 13 is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
  • the exception type includes an unrecoverable memory error
  • the exception handling module 12 is configured to generate isolated memory information according to the abnormal information, and perform corresponding processing on the board:
  • the isolated memory information is generated by the BMC according to the abnormal memory module identifier. And performing power-off and power-on operations on the server board;
  • the abnormal information is used by the BMC Forwarding to the SMM, the SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
  • the type of exception includes a recoverable memory error
  • the exception handling module 12 is configured to generate isolated memory information according to the abnormal information, and perform corresponding processing on the board:
  • the BMC When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information.
  • the number of times and frequency statistics when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;
  • the abnormal information is forwarded by the BMC to the SMM;
  • the SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information.
  • the SMM is based on the abnormality.
  • the information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.
  • the isolating module 13 is adapted to send the isolated memory information to the BIOS:
  • the information processing module 11 is further adapted to send the exception information to the interface browser B/client C.
  • FIG. 4 is a flowchart of a method for self-healing a server according to another embodiment of the present invention. among them:
  • the BIOS is responsible for detecting memory exceptions, distinguishing between one ECC error that can be recovered and two unrecoverable ECC errors, and can locate the fault to a specific physical memory stick; if the system starts again after self-healing, the abnormal memory stick can be implemented. Isolation, no longer used.
  • the BMC is responsible for forwarding the memory exception reported by the BIOS to the SMM, or directly completing the SMM function described in step 3, and reporting the faulty memory bar information to the basic input/output system BIOS when the server is powered on again.
  • the SMM receives the memory fault information forwarded by the out-of-band management module, and distinguishes the memory module from the abnormal number. Based on the abnormality of the memory and the frequency of the abnormality, the SMM determines whether to perform self-healing processing on the specified abnormal board.
  • the server has a memory error in the BIOS startup phase or in the OS running phase, and the error can be detected by the BIOS; the BIOS parses out the memory module corresponding to the memory error and reports it to the BMC; the BMC reports the memory error.
  • the BIOS parses out the memory module corresponding to the memory error and reports it to the BMC; the BMC reports the memory error.
  • SMM or provide a B/C query.
  • BMC or SMM also counts the number of different types of errors that occur during a period of time, which can be counted on a per-memory basis.
  • Step A The BMC of the server board receives an unrecoverable memory error reported by the BIOS, or the SMM receives an unrecoverable memory error from the BMC.
  • the BMC/SMM automatically powers off the board and then powers on the board. And then perform step B;
  • Step B After the board is powered on again, the BMC actively sends the number of the memory module that detected the unrecoverable fault to the BIOS, and performs step C;
  • Step C After receiving the BIOS, the memory module that has an unrecoverable fault is masked, that is, the memory that has an unrecoverable fault is not used after the startup.
  • Step A The BMC of the server board receives a recoverable memory error reported by the BIOS, or the SMM receives a recoverable memory error reported by the BMC, and records the number and frequency of such abnormalities according to the memory, and The set threshold is compared, if the set threshold is reached, the BMC/SMM automatically performs power-off and power-on processing on the board, and then performs step B;
  • Step B After the board is powered on again, the BMC actively sends the last detected memory code of the recoverable fault that reaches the set threshold to the BIOS, and performs step C;
  • the SMM if the SMM is powered off and then powered on, the SMM generates the code of the memory to be isolated, and then sends the code to the BMC, which is forwarded to the BIOS by the BMC.
  • Step C After receiving the BIOS, the memory strips reported by the BMC are masked, that is, the memory modules are not used after the startup.
  • the above operation can ensure that the memory with frequent abnormalities is isolated, and the memory containing the hidden problem is automatically isolated in advance to ensure the stability and reliability of the system.
  • the BIOS sends an unrecoverable fault
  • the corresponding memory stick in the fault information sent by the BIOS is a memory strip that needs to be isolated. If the recoverable fault is received by the BMC, the number of times of the corresponding memory is counted. The memory stick that reaches the isolation threshold needs to be isolated. At the same time, for the recoverable fault, the number of times that a certain period of time occurs, that is, the frequency threshold, or the total number of thresholds can be set, and different strategies can be configured according to specific implementation requirements.
  • the above technical solution detects the abnormality of the server memory by the BMC, the SMM, and the BIOS, and performs self-healing control according to the set policy.
  • the memory abnormality can be refined to a specific memory bar, according to a serious abnormality of the memory, and a specific memory bar within a fixed time.
  • An abnormal frequency occurs to determine whether the board is powered off and then powered on. When the power is initialized again, the abnormal memory is isolated and is no longer used. This avoids the inability of the original server to automatically recover when the memory is abnormal, and the trouble of manual recovery on site must reduce the possibility of manual intervention when an abnormality occurs, and also greatly improve the reliability of the system and accelerate the recovery time of the server. .
  • the steps may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device. Implementations such that they can be stored in a storage device by a computing device, or fabricated separately into individual integrated circuit modules, or a plurality of modules or steps thereof can be implemented as a single integrated circuit module.
  • the technical solution provided by the present invention detects the abnormality of the server memory by the BMC, the SMM, and the BIOS, and isolates the abnormal memory bar according to the abnormality of the memory and the abnormal frequency of the specific memory bar in a fixed time. It can avoid the original server can not automatically recover when the memory is abnormal, the trouble of manual recovery on site, reduce the possibility of manual intervention when an abnormality occurs, and can greatly improve the reliability of the system and accelerate the recovery time of the server.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Power Sources (AREA)

Abstract

Provided is a server self-healing method, said method comprising: a baseboard management controller (BMC) receives exception information sent by a basic input/output system (BIOS), said exception information comprising memory exception type and exception memory module identifier; according to said exception information, the BMC or system management module (SMM) generates quarantine memory information, and processes the board accordingly; the BMC sends the quarantine memory information to the BIOS, said quarantine memory information being used for instructing the BIOS to quarantine the corresponding exception memory. By means of the coordination of the BMC, the BIOS, and the SMM, the described solution accomplishes automatic self-healing of a server, which reduces the possibility of on-site manual intervention and operation and restores the server to a state of normal operation as quickly as possible.

Description

一种服务器自愈的方法和装置Method and device for self-healing of server 技术领域Technical field
本发明涉及服务器领域,具体涉及服务器自愈的方法和装置。The present invention relates to the field of servers, and in particular to a method and apparatus for server self-healing.
背景技术Background technique
目前运营商面临着巨大的挑战,必须能够快速整合网络资源来为用户提供最新的业务,同时也必须降低网络的采购成本、运营维护成本和故障恢复时间。运营商拥有的大量服务器安装了大量的内存,因为内存故障导致服务器异常的现象普遍存在,降低了运营商提供服务的稳定性,增加了故障恢复时间和维护成本。At present, operators face enormous challenges. They must be able to quickly integrate network resources to provide users with the latest services. At the same time, they must reduce network procurement costs, operation and maintenance costs, and failure recovery time. A large number of servers owned by the operator install a large amount of memory, because the memory failure causes the server to be abnormal, which reduces the stability of the service provided by the operator, and increases the recovery time and maintenance cost.
在服务器中,BMC(Baseboard Management Controller,带外管理模块)监控服务器的工作状态,管理服务器上、下电,服务器异常时及时处理并进行告警。BMC是作为一个独立固件存在,它可以接受SMM的指令,并将监控到的服务器异常信息上报给SMM(System Management Module,系统管理模块);它还可以提供B/C(Browser/Client,管理界面浏览器/客户机),接受B/C的控制指令或下发的控制策略,返回给B/C服务器当前或者历史的健康状态。服务器内存的可靠性直接影响到单板的稳定和可靠性,内存出现问题直接导致业务中断,严重时会出现宕机现象。虽然大多高性能、高可靠性的服务器采用的都是带有ECC(Error Checking and Correcting,错误检查和纠正)功能的内存,但是对于系统的可靠性提升也有限。主要有如下几个方面:第一,出现可以纠正的ECC错误后,虽然带有此ECC功能的内存可以自动纠错,但是如果频繁发生,说明此内存存在严重隐患,因此这种自动纠错的处理方法相对被动,因为系统存在的严重隐患没有排除;第二,出现不可纠正的ECC或者其他不可恢复的错误后,系统会出现蓝屏或宕机等严重后果,此种严重后果如果没有带外参与,只能到现场人员将服务器关机、更换内存。On the server, the BMC (Baseband Management Controller) monitors the working status of the server. The management server is powered on and off. When the server is abnormal, the alarm is processed and alarmed. The BMC exists as a stand-alone firmware. It can accept the SMM command and report the monitored server exception information to the SMM (System Management Module). It can also provide B/C (Browser/Client, management interface). The browser/client) accepts the B/C control command or the issued control policy and returns the current or historical health status of the B/C server. The reliability of the server memory directly affects the stability and reliability of the board. If the memory is faulty, the service is interrupted. In severe cases, the system may be down. Although most high-performance, high-reliability servers use memory with ECC (Error Checking and Correcting), the reliability of the system is limited. There are mainly the following aspects: First, after the ECC error that can be corrected, although the memory with this ECC function can be automatically corrected, if it occurs frequently, it indicates that this memory has serious hidden danger, so this automatic error correction The processing method is relatively passive, because the serious hidden dangers of the system are not ruled out. Second, after uncorrectable ECC or other unrecoverable errors, the system will have serious consequences such as blue screen or downtime. If there is no out-of-band participation Only the on-site personnel can shut down the server and replace the memory.
发明内容 Summary of the invention
本发明实施例提供一种服务器自愈的方法和装置,以减少人工现场干预和操作服务器故障的问题。Embodiments of the present invention provide a method and apparatus for server self-healing to reduce the problem of manual field intervention and operation server failure.
为解决上述技术问题,本发明实施例提供一种服务器自愈的方法,所述方法包括:To solve the above technical problem, an embodiment of the present invention provides a method for server self-healing, the method comprising:
服务器单板的带外管理模块BMC接收基本输入输出系统BIOS发送的异常信息,所述异常信息包括内存异常类型和异常内存条标识;The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, and the abnormal information includes a memory exception type and an abnormal memory module identifier;
所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理;The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board;
所述BMC将所述隔离内存信息发送给所述BIOS,所述隔离内存信息用于指示所述BIOS隔离相应的异常内存。The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
可选地,所述内存异常类型包括不可恢复的内存错误;Optionally, the memory exception type includes an unrecoverable memory error;
所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理包括:The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:
当所述BMC接收的所述异常信息的内存异常类型是所述不可恢复的内存错误,并且所述BMC配置有治愈功能时,所述BMC根据所述内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the memory module identifier, and Performing a power-off and power-on operation on the server board;
或者,or,
当所述BMC接收的所述异常信息的内存异常类型是所述不可恢复的内存错误,并且所述BMC未配置有治愈功能时,所述BMC将所述异常信息转发给所述SMM,所述SMM根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作。When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM, The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
可选地,所述异常类型包括可恢复的内存错误;Optionally, the exception type includes a recoverable memory error;
所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理包括:The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:
当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC配置有治愈功能时,所述BMC对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计;当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,所述BMC根据该异常内存条的信 息生成隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error count on the abnormal memory bar corresponding to the abnormality information. And frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC according to the abnormal memory Generates isolated memory information and performs power-off and power-on operations on the server board.
或者,or,
当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC未配置有治愈功能时,所述BMC将所述异常信息转发给所述SMM;所述SMM对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计,当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,所述SMM根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作。When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM; the SMM pair The abnormal memory bar corresponding to the abnormal information performs the recoverable memory error frequency and frequency statistics. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormal memory module. The information is generated to isolate the memory information, and the server board is powered off and then powered on.
可选地,所述BMC将所述隔离内存信息发送给所述BIOS包括:Optionally, the sending, by the BMC, the isolated memory information to the BIOS includes:
所述BMC将所述BMC生成的所述隔离内存信息发送给所述BIOS;或者,所述BMC接收所述SMM生成的所述隔离内存信息后,将该隔离内存信息发送给所述BIOS。The BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.
可选地,所述BMC接收所述BIOS发送的异常信息后还包括:Optionally, after receiving the abnormal information sent by the BIOS, the BMC further includes:
所述BMC将所述异常信息发送给界面浏览器B/客户机C。The BMC sends the exception information to the interface browser B/client C.
本发明还提供一种服务器自愈的装置,所述装置包括:The invention also provides a device for self-healing of a server, the device comprising:
信息处理模块,设置为接收服务器单板的基本输入输出系统BIOS发送的异常信息,所述异常信息包括异常类型和异常内存条标识;An information processing module is configured to receive abnormal information sent by a BIOS of a basic input/output system of the server board, where the abnormal information includes an abnormal type and an abnormal memory stick identifier;
异常处理模块,设置为根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理;An exception processing module is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;
隔离模块,设置为将所述隔离内存信息发送给所述BIOS,所述隔离内存信息用于指示所述BIOS隔离相应内存。The isolation module is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding memory.
可选地,异常处理模块是设置为:Optionally, the exception handling module is set to:
当所述BMC接收的所述异常信息的内存异常类型是不可恢复的内存错误,并且所述BMC配置有治愈功能时,由所述BMC根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is configured to have a healing function, the isolated memory information is generated by the BMC according to the abnormal memory module identifier, and Performing a power-off and power-on operation on the server board;
或者, Or,
当所述BMC接收的所述异常信息的内存异常类型是不可恢复的内存错误,并且所述BMC未配置有治愈功能时,由所述BMC将所述异常信息转发给所述SMM,由所述SMM根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作。And when the memory abnormality type of the abnormal information received by the BMC is an unrecoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM, by the The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
可选地,所述异常处理模块是设置为:Optionally, the exception handling module is configured to:
当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC配置有治愈功能时,由所述BMC对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计;当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,由所述BMC根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information. The number of times and frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;
或者,or,
当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC未配置有治愈功能时,由所述BMC将所述异常信息转发给所述SMM;由所述SMM对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计,当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,由所述SMM根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作。And when the memory abnormality type of the abnormal information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM; The SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormality. The information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.
可选地,所述隔离模块设置为将所述隔离内存信息发送给所述BIOS是指:Optionally, the isolating module is configured to send the isolated memory information to the BIOS, where:
由所述BMC将其生成的所述隔离内存信息发送给所述BIOS;或者,由所述BMC接收所述SMM生成的所述隔离内存信息后,将该隔离内存信息发送给所述BIOS。And sending, by the BMC, the isolated memory information generated by the BMC to the BIOS; or, after receiving, by the BMC, the isolated memory information generated by the SMM, sending the isolated memory information to the BIOS.
可选地,信息处理模块还设置为将所述异常信息发送给界面浏览器B/客户机C。Optionally, the information processing module is further configured to send the exception information to the interface browser B/client C.
本发明实施例还提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,该计算机程序包括程序指令,当该程序指令被服务器设备执行时,使得该设备可执行上述的服务器自愈方法。The embodiment of the present invention further provides a computer readable storage medium, where the storage medium stores a computer program, where the computer program includes program instructions, when the program instruction is executed by the server device, enabling the device to execute the server itself. The more the method.
上述方案通过BMC、BIOS(Basic Input Output System,基本输入输出系 统)以及SMC的配合下,一起完成服务器自动自愈,减少人工现场干预和操作的可能,尽快恢复服务器的正常工作状态。The above solution passes BMC, BIOS (Basic Input Output System, basic input and output system With the cooperation of SMC and the SMC, the server automatically self-healing, reducing the possibility of manual intervention and operation, and restoring the normal working state of the server as soon as possible.
附图概述BRIEF abstract
图1是本发明一实施例的服务器管理系统架构示意图;1 is a schematic structural diagram of a server management system according to an embodiment of the present invention;
图2是本发明一实施例的服务器自愈的方法的流程图;2 is a flow chart of a method for self-healing of a server according to an embodiment of the present invention;
图3是本发明一实施例的服务器自愈的装置的结构示意图;3 is a schematic structural diagram of a device for self-healing of a server according to an embodiment of the present invention;
图4是本发明另一实施例的服务器自愈的方法的流程图。4 is a flow chart of a method for server self-healing according to another embodiment of the present invention.
本发明的较佳实施方式Preferred embodiment of the invention
下文中将结合附图对本申请的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。Embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
实施例一Embodiment 1
如图1所示的服务器管理系统结构中,包含了SMM和若干从节点,即各个服务器上的BMC,并且每个服务器单板上都有BIOS。SMM同各个服务器的BMC通过IPMB(Intelligent Platform Management BUS,智能平台管理总线)/LAN(Local Area Network,局域网)等多种方式连接,BMC与BIOS可以通过各种不同类型的物理通道通信,这种系统结构提供SMM管理服务器内存异常的物理通道。在服务器系统中,服务器多采用支持ECC功能的内存,为及时发现内存异常提供硬件前提条件。B/C的主要作用是配置BMC如何处理内存异常,比如配置一条策略,如当某一根内存条发生可恢复内存故障的频率大于某一个阀值时重启单板并隔离该故障等。此外,B/C还可以查询内存发生故障的情况,还能提供单板上下电操作界面。The server management system structure shown in FIG. 1 includes an SMM and a plurality of slave nodes, that is, BMCs on each server, and each server has a BIOS. The SMM is connected to the BMC of each server through various methods such as IPMB (Intelligent Platform Management BUS)/LAN (Local Area Network), and the BMC and the BIOS can communicate through various types of physical channels. The system structure provides a physical channel for SMM to manage server memory anomalies. In the server system, the server uses the memory that supports the ECC function, and provides hardware prerequisites for timely discovering memory exceptions. The main function of the B/C is to configure the BMC to handle memory exceptions. For example, configure a policy, such as restarting a board and isolating the fault when the frequency of recoverable memory faults of a certain memory module is greater than a certain threshold. In addition, the B/C can also query the memory failure and provide a power-down interface on the board.
如图2所示,本发明实施例提供一种服务器自愈的方法,所述方法包括:As shown in FIG. 2, an embodiment of the present invention provides a method for self-healing a server, where the method includes:
步骤S100:服务器单板的带外管理模块BMC接收基本输入输出系统BIOS发送的异常信息,所述异常信息包括内存异常类型和异常内存条标 识;Step S100: The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, where the abnormal information includes a memory exception type and an abnormal memory barcode. knowledge;
步骤S102:所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理;Step S102: The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board.
步骤S104:所述BMC将所述隔离内存信息发送给所述BIOS,所述隔离内存信息用于指示所述BIOS隔离相应的异常内存。Step S104: The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
较佳地,所述异常类型包括不可恢复的内存错误;Preferably, the exception type includes an unrecoverable memory error;
所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对单板进行相应的处理包括:The BMC or the system management module SMM generates the isolated memory information according to the abnormal information, and performs corresponding processing on the board, including:
当所述BMC接收的所述异常信息的异常类型是不可恢复的内存错误,并且所述BMC配置有治愈功能时,所述BMC根据所述异常内存条标识生成所述隔离内存信息,并对所述单板执行下电再上电操作;When the abnormality type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is configured with a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier, and generates The board performs power-off and power-on operations;
或者,or,
当所述BMC接收的所述异常信息的异常类型是不可恢复的内存错误,并且所述BMC未配置有治愈功能时,所述BMC将所述异常信息转发给所述SMM,所述SMM根据所述异常内存条标识生成所述隔离内存信息,并对所述单板执行下电再上电操作。When the abnormal type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM, where the SMM is The abnormal memory module identifier generates the isolated memory information, and performs power-off and power-on operations on the board.
较佳地,所述异常类型包括可恢复的内存错误;Preferably, the type of exception includes a recoverable memory error;
所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对单板进行相应的处理包括:The BMC or the system management module SMM generates the isolated memory information according to the abnormal information, and performs corresponding processing on the board, including:
当所述BMC接收的所述异常信息的异常类型是可恢复的内存错误,并且所述BMC配置有治愈功能时,所述BMC对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计;当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,所述BMC根据该异常内存条的信息生成隔离内存信息,并对所述单板执行下电再上电操作;When the abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error number of the abnormal memory bar corresponding to the abnormality information. Frequency statistics: When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates isolated memory information based on the information of the abnormal memory module, and performs power-off on the board. Power-on operation;
或者,or,
当所述BMC接收的所述异常信息的异常类型是可恢复的内存错误,并且所述BMC未配置有治愈功能时,所述BMC将所述异常信息转发给所述SMM;所述SMM对该异常信息对应的异常内存条进行可恢复的内存错误次 数和频度统计,当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,所述SMM根据该异常内存条的信息生成隔离内存信息,并对所述单板执行下电再上电操作。When the abnormal type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM; the SMM The abnormal memory corresponding to the exception information performs a recoverable memory error. The SMM generates the isolated memory information according to the information of the abnormal memory module, and performs the execution on the board when the number of the recoverable memory errors or the frequency reaches the set isolation threshold. The power is then powered on.
较佳地,所述BMC将所述隔离内存信息发送给所述BIOS包括:Preferably, the sending, by the BMC, the isolated memory information to the BIOS includes:
所述BMC将其生成的所述隔离内存信息发送给所述BIOS;或者,所述BMC接收所述SMM生成的所述隔离内存信息后,将该隔离内存信息发送给所述BIOS。The BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.
较佳地,所述BMC接收所述BIOS发送的异常信息后还包括:Preferably, after receiving the abnormality information sent by the BIOS, the BMC further includes:
所述BMC将所述异常信息发送给界面浏览器B/客户机C。The BMC sends the exception information to the interface browser B/client C.
如图3所示,本发明实施例还提供一种服务器自愈的装置,所述装置包括处理器、程序存储设备和数据存储设备,还包括:As shown in FIG. 3, an embodiment of the present invention further provides a device for self-healing a server, where the device includes a processor, a program storage device, and a data storage device, and further includes:
信息处理模块11,适用于接收服务器单板的基本输入输出系统BIOS发送的异常信息,所述异常信息包括内存异常类型和异常内存条标识;The information processing module 11 is configured to receive abnormal information sent by the BIOS of the basic input/output system of the server board, where the abnormal information includes a memory abnormal type and an abnormal memory barcode identifier;
异常处理模块12,适用于根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理;The exception processing module 12 is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;
隔离模块13,适用于将所述隔离内存信息发送给所述BIOS,所述隔离内存信息用于指示所述BIOS隔离相应的异常内存。The isolation module 13 is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
较佳地,所述异常类型包括不可恢复的内存错误;Preferably, the exception type includes an unrecoverable memory error;
异常处理模块12适用于根据所述异常信息生成隔离内存信息,并对单板进行相应的处理是指:The exception handling module 12 is configured to generate isolated memory information according to the abnormal information, and perform corresponding processing on the board:
当所述BMC接收的所述异常信息的异常类型是所述不可恢复的内存错误,并且所述BMC配置有治愈功能时,由所述BMC根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作;When the abnormal type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured with a healing function, the isolated memory information is generated by the BMC according to the abnormal memory module identifier. And performing power-off and power-on operations on the server board;
或者,or,
当所述BMC接收的所述异常信息的内存异常类型是所述不可恢复的内存错误,并且所述BMC未配置有治愈功能时,由所述BMC将所述异常信息 转发给所述SMM,由所述SMM根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作。When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is not configured with a healing function, the abnormal information is used by the BMC Forwarding to the SMM, the SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
较佳地,所述异常类型包括可恢复的内存错误;Preferably, the type of exception includes a recoverable memory error;
异常处理模块12适用于根据所述异常信息生成隔离内存信息,并对单板进行相应的处理是指:The exception handling module 12 is configured to generate isolated memory information according to the abnormal information, and perform corresponding processing on the board:
当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC配置有治愈功能时,由所述BMC对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计;当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,由所述BMC根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information. The number of times and frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;
或者,or,
当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC未配置有治愈功能时,由所述BMC将所述异常信息转发给所述SMM;由所述SMM对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计,当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,由所述SMM根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作。And when the memory abnormality type of the abnormal information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM; The SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormality. The information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.
较佳地,所述隔离模块13适用于将所述隔离内存信息发送给所述BIOS是指:Preferably, the isolating module 13 is adapted to send the isolated memory information to the BIOS:
由所述BMC将其生成的所述隔离内存信息发送给所述BIOS;或者,由所述BMC接收所述SMM生成的所述隔离内存信息后,将该隔离内存信息发送给所述BIOS。And sending, by the BMC, the isolated memory information generated by the BMC to the BIOS; or, after receiving, by the BMC, the isolated memory information generated by the SMM, sending the isolated memory information to the BIOS.
较佳地,信息处理模块11还适用于将所述异常信息发送给界面浏览器B/客户机C。Preferably, the information processing module 11 is further adapted to send the exception information to the interface browser B/client C.
实施例二Embodiment 2
如图4所示,为本发明另一实施例的服务器自愈的方法的流程图。其中: FIG. 4 is a flowchart of a method for self-healing a server according to another embodiment of the present invention. among them:
BIOS负责检测内存异常,可区分可恢复的一位ECC错误和不可恢复的两位ECC错误,并且可以把故障定位到具体的物理内存条;如果系统自愈后再次启动,可以实现异常内存条的隔离,不再使用。The BIOS is responsible for detecting memory exceptions, distinguishing between one ECC error that can be recovered and two unrecoverable ECC errors, and can locate the fault to a specific physical memory stick; if the system starts again after self-healing, the abnormal memory stick can be implemented. Isolation, no longer used.
BMC负责把BIOS上报的内存异常转发给SMM,或者直接完成步骤3描述的SMM的功能,以及在服务器重新上电时将故障内存条信息报给基本输入输出系统BIOS。The BMC is responsible for forwarding the memory exception reported by the BIOS to the SMM, or directly completing the SMM function described in step 3, and reporting the faulty memory bar information to the basic input/output system BIOS when the server is powered on again.
SMM接收到带外管理模块转发的内存故障信息,区分内存条做异常数目统计,根据内存异常严重情况和异常发生频率,决定是否对指定异常单板做自愈处理。The SMM receives the memory fault information forwarded by the out-of-band management module, and distinguishes the memory module from the abnormal number. Based on the abnormality of the memory and the frequency of the abnormality, the SMM determines whether to perform self-healing processing on the specified abnormal board.
在本实施例中,服务器在BIOS启动阶段或者是在OS运行阶段出现内存错误,该错误可以被BIOS检测到;BIOS解析出对应内存错误所在的内存条,并上报给BMC;BMC将内存错误上报给SMM,或者提供B/C查询。In this embodiment, the server has a memory error in the BIOS startup phase or in the OS running phase, and the error can be detected by the BIOS; the BIOS parses out the memory module corresponding to the memory error and reports it to the BMC; the BMC reports the memory error. Give SMM, or provide a B/C query.
同时,BMC或SMM还统计一段时间内发生的不同类型错误的次数,可以按照每根内存条为基础统计。At the same time, BMC or SMM also counts the number of different types of errors that occur during a period of time, which can be counted on a per-memory basis.
需要说明的是,在本实施例的服务器治愈的过程中,根据内存错误类型进行不同处理流程,内存错误类型包括不可恢复的内存错误和可恢复的内存错误。如下分别对不可恢复的内存错误和可恢复的内存错误处理流程进行说明。It should be noted that, in the process of the server healing in this embodiment, different processing flows are performed according to the type of memory error, and the memory error types include unrecoverable memory errors and recoverable memory errors. The unrecoverable memory error and recoverable memory error handling flow are described as follows.
一、对于不可恢复的内存错误First, for unrecoverable memory errors
步骤A:服务器单板的BMC收到BIOS上报的不可恢复的内存错误,或是SMM收到BMC转发的不可恢复的内存错误,BMC/SMM自动对所述单板进行下电再上电的处理,然后执行步骤B;Step A: The BMC of the server board receives an unrecoverable memory error reported by the BIOS, or the SMM receives an unrecoverable memory error from the BMC. The BMC/SMM automatically powers off the board and then powers on the board. And then perform step B;
步骤B:单板重新上电后,BMC主动把上次检测到不可恢复故障的内存条编号发给BIOS,执行步骤C;Step B: After the board is powered on again, the BMC actively sends the number of the memory module that detected the unrecoverable fault to the BIOS, and performs step C;
步骤C:BIOS接收后,对发生不可恢复故障的内存条进行屏蔽处理,即本次启动后不使用这些存在不可恢复故障的内存。Step C: After receiving the BIOS, the memory module that has an unrecoverable fault is masked, that is, the memory that has an unrecoverable fault is not used after the startup.
通过上述操作可以达到针对此类严重内存错误自动自愈处理,减少人工干预,否则此类故障需要现场人员干预来解决。 Through the above operations, automatic self-healing processing for such serious memory errors can be achieved, and manual intervention is reduced, otherwise such faults need to be solved by on-site personnel intervention.
二、对于可恢复的内存错误Second, for recoverable memory errors
步骤A:服务器单板的BMC收到BIOS上报的可恢复的内存错误,或是SMM收到BMC上报的可恢复的内存错误,按照内存条进行记录此类异常发生次数和频度,并与事先设定的发生阀值相比较,如果达到了设定的阈值,BMC/SMM自动对所述单板进行下电再上电的处理,然后执行步骤B;Step A: The BMC of the server board receives a recoverable memory error reported by the BIOS, or the SMM receives a recoverable memory error reported by the BMC, and records the number and frequency of such abnormalities according to the memory, and The set threshold is compared, if the set threshold is reached, the BMC/SMM automatically performs power-off and power-on processing on the board, and then performs step B;
步骤B:单板重新上电后,BMC主动把上次检测到的达到设定阀值的可恢复故障的内存条编码发给BIOS,执行步骤C;Step B: After the board is powered on again, the BMC actively sends the last detected memory code of the recoverable fault that reaches the set threshold to the BIOS, and performs step C;
需要说明的是,如果是SMM进行下电再上电处理,则SMM生成待隔离的内存条编码后,将该编码发送给BMC,由BMC转发给BIOS。It should be noted that, if the SMM is powered off and then powered on, the SMM generates the code of the memory to be isolated, and then sends the code to the BMC, which is forwarded to the BIOS by the BMC.
步骤C:BIOS接收后,对BMC上报的内存条进行屏蔽处理,即本次启动后不使用这些内存条。Step C: After receiving the BIOS, the memory strips reported by the BMC are masked, that is, the memory modules are not used after the startup.
上述操作可以保证把频繁发生异常的内存进行隔离,自动提前隔离含有隐患问题的内存,达到保证系统稳定和可靠的目的。The above operation can ensure that the memory with frequent abnormalities is isolated, and the memory containing the hidden problem is automatically isolated in advance to ensure the stability and reliability of the system.
需要说明书的是,如果BIOS发送的是不可恢复故障,那么BIOS发送的该故障信息中对应的内存条就是需要隔离的内存条。如果BMC接收到的可恢复故障,就要对相应的内存条进行次数统计,达到隔离阈值的内存条才是需要隔离的。同时,对于可恢复故障可以设置某一段时间发生的次数,即频度阀值,也可以是总的次数阀值,可以根据具体实现要求配置不同的策略。It should be noted that if the BIOS sends an unrecoverable fault, the corresponding memory stick in the fault information sent by the BIOS is a memory strip that needs to be isolated. If the recoverable fault is received by the BMC, the number of times of the corresponding memory is counted. The memory stick that reaches the isolation threshold needs to be isolated. At the same time, for the recoverable fault, the number of times that a certain period of time occurs, that is, the frequency threshold, or the total number of thresholds can be set, and different strategies can be configured according to specific implementation requirements.
上述技术方案由BMC、SMM和BIOS对服务器内存异常的进行检测,并根据设定的策略做自愈控制,内存异常可以细化到具体内存条、根据内存异常严重情况、固定时间内具体内存条发生异常频率,来确定是否对这个单板进行下电再上电处理,并且再次上电初始化时,把这个异常内存隔离,不再使用。这样避免了原有服务器在内存出现异常时无法自动恢复,必须现场人工恢复的麻烦,减少了发生异常时需要人工干预的可能,同时还极大提高了系统的可靠性,加速了服务器故障恢复时间。The above technical solution detects the abnormality of the server memory by the BMC, the SMM, and the BIOS, and performs self-healing control according to the set policy. The memory abnormality can be refined to a specific memory bar, according to a serious abnormality of the memory, and a specific memory bar within a fixed time. An abnormal frequency occurs to determine whether the board is powered off and then powered on. When the power is initialized again, the abnormal memory is isolated and is no longer used. This avoids the inability of the original server to automatically recover when the memory is abnormal, and the trouble of manual recovery on site must reduce the possibility of manual intervention when an abnormality occurs, and also greatly improve the reliability of the system and accelerate the recovery time of the server. .
需要强调的是,本领域的技术人员应该明白,本发明实施例中涵盖的策 略和步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。It should be emphasized that those skilled in the art should understand the policies covered in the embodiments of the present invention. The steps may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device. Implementations such that they can be stored in a storage device by a computing device, or fabricated separately into individual integrated circuit modules, or a plurality of modules or steps thereof can be implemented as a single integrated circuit module.
以上所述仅为本发明的较佳实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现,相应地,上述实施例中的各模块/模块可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请不限制于任何特定形式的硬件和软件的结合。The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention. One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Optionally, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits. Accordingly, each module/module in the foregoing embodiment may be implemented in the form of hardware, or may be implemented by using a software function module. Formal realization. This application is not limited to any specific combination of hardware and software.
工业实用性Industrial applicability
本发明提供的技术方案,由BMC、SMM和BIOS对服务器内存异常的进行检测,根据内存异常严重情况、固定时间内具体内存条发生异常频率,在再次上电初始化时,隔离异常的内存条,可以避免原有服务器在内存出现异常时无法自动恢复,必须现场人工恢复的麻烦,减少发生异常时需要人工干预的可能,可以还极大提高系统的可靠性,加速服务器故障恢复时间。 The technical solution provided by the present invention detects the abnormality of the server memory by the BMC, the SMM, and the BIOS, and isolates the abnormal memory bar according to the abnormality of the memory and the abnormal frequency of the specific memory bar in a fixed time. It can avoid the original server can not automatically recover when the memory is abnormal, the trouble of manual recovery on site, reduce the possibility of manual intervention when an abnormality occurs, and can greatly improve the reliability of the system and accelerate the recovery time of the server.

Claims (11)

  1. 一种服务器自愈的方法,包括:A method for server self-healing, including:
    服务器单板的带外管理模块BMC接收基本输入输出系统BIOS发送的异常信息,所述异常信息包括内存异常类型和异常内存条标识;The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, and the abnormal information includes a memory exception type and an abnormal memory module identifier;
    所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理;The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board;
    所述BMC将所述隔离内存信息发送给所述BIOS,所述隔离内存信息用于指示所述BIOS隔离相应的异常内存。The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
  2. 如权利要求1所述的方法,其中:The method of claim 1 wherein:
    所述内存异常类型包括不可恢复的内存错误;The memory exception type includes an unrecoverable memory error;
    所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理包括:The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:
    当所述BMC接收的所述异常信息的内存异常类型是所述不可恢复的内存错误,并且所述BMC配置有治愈功能时,所述BMC根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier. And performing power-off and power-on operations on the server board;
    或者,or,
    当所述BMC接收的所述异常信息的内存异常类型是所述不可恢复的内存错误,并且所述BMC未配置有治愈功能时,所述BMC将所述异常信息转发给所述SMM,所述SMM根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作。When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM, The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
  3. 如权利要求1所述的方法,其中:The method of claim 1 wherein:
    所述内容异常类型包括可恢复的内存错误;The content exception type includes a recoverable memory error;
    所述BMC或者系统管理模块SMM根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理包括:The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:
    当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC配置有治愈功能时,所述BMC对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计;当统计出的可恢复的内存错误次 数或者频度达到设定的隔离阈值时,所述BMC根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error count on the abnormal memory bar corresponding to the abnormality information. And frequency statistics; when statistics of recoverable memory errors are counted When the number or frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory module, and performs power-off and power-on operations on the server board;
    或者,or,
    当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC未配置有治愈功能时,所述BMC将所述异常信息转发给所述SMM;所述SMM对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计,当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,所述SMM根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作。When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM; the SMM pair The abnormal memory bar corresponding to the abnormal information performs the recoverable memory error frequency and frequency statistics. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormal memory module. The information is generated to isolate the memory information, and the server board is powered off and then powered on.
  4. 如权利要求1至3任一所述的方法,其中:A method as claimed in any one of claims 1 to 3 wherein:
    所述BMC将所述隔离内存信息发送给所述BIOS包括:Sending, by the BMC, the isolated memory information to the BIOS includes:
    所述BMC将所述BMC生成的所述隔离内存信息发送给所述BIOS;或者,所述BMC接收所述SMM生成的所述隔离内存信息后,将该隔离内存信息发送给所述BIOS。The BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.
  5. 如权利要求4所述的方法,其中:The method of claim 4 wherein:
    所述BMC接收所述BIOS发送的异常信息后还包括:After receiving the abnormality information sent by the BIOS, the BMC further includes:
    所述BMC将所述异常信息发送给界面浏览器B/客户机C。The BMC sends the exception information to the interface browser B/client C.
  6. 一种服务器自愈的装置,包括:A device for self-healing of a server, comprising:
    信息处理模块,设置为接收服务器单板的基本输入输出系统BIOS发送的异常信息,所述异常信息包括内存异常类型和异常内存条标识;An information processing module is configured to receive abnormal information sent by a BIOS of a basic input/output system of the server board, where the abnormal information includes a memory exception type and an abnormal memory module identifier;
    异常处理模块,设置为根据所述异常信息生成隔离内存信息,并对所述服务器单板进行相应的处理;An exception processing module is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;
    隔离模块,设置为将所述隔离内存信息发送给所述BIOS,所述隔离内存信息用于指示所述BIOS隔离相应的异常内存。The isolation module is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
  7. 如权利要求6所述的装置,其中:The apparatus of claim 6 wherein:
    所述异常处理模块是设置为:The exception handling module is set to:
    当所述BMC接收的所述异常信息的内存异常类型是不可恢复的内存错 误,并且所述BMC配置有治愈功能时,由所述BMC根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作;The memory exception type of the exception information received by the BMC is an unrecoverable memory error If the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
    或者,or,
    当所述BMC接收的所述异常信息的内存异常类型是不可恢复的内存错误,并且所述BMC未配置有治愈功能时,由所述BMC将所述异常信息转发给所述SMM,由所述SMM根据所述异常内存条标识生成所述隔离内存信息,并对所述服务器单板执行下电再上电操作。And when the memory abnormality type of the abnormal information received by the BMC is an unrecoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM, by the The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
  8. 如权利要求6所述的装置,其中:The apparatus of claim 6 wherein:
    所述异常处理模块是设置为:The exception handling module is set to:
    当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC配置有治愈功能时,由所述BMC对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计;当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,由所述BMC根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作;When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information. The number of times and frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;
    或者,or,
    当所述BMC接收的所述异常信息的内存异常类型是可恢复的内存错误,并且所述BMC未配置有治愈功能时,由所述BMC将所述异常信息转发给所述SMM;由所述SMM对该异常信息对应的异常内存条进行可恢复的内存错误次数和频度统计,当统计出的可恢复的内存错误次数或者频度达到设定的隔离阈值时,由所述SMM根据该异常内存条的信息生成隔离内存信息,并对所述服务器单板执行下电再上电操作。And when the memory abnormality type of the abnormal information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM; The SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormality. The information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.
  9. 如权利要求6至8任一所述的装置,其中:A device according to any of claims 6 to 8, wherein:
    所述隔离模块设置为将所述隔离内存信息发送给所述BIOS是指:The isolating module is configured to send the isolated memory information to the BIOS, where:
    由所述BMC将所述BMC生成的所述隔离内存信息发送给所述BIOS;或者,由所述BMC接收所述SMM生成的所述隔离内存信息后,将该隔离内存信息发送给所述BIOS。And sending, by the BMC, the isolated memory information generated by the BMC to the BIOS; or, after receiving, by the BMC, the isolated memory information generated by the SMM, sending the isolated memory information to the BIOS .
  10. 如权利要求9所述的装置,其中:The apparatus of claim 9 wherein:
    信息处理模块还设置为将所述异常信息发送给界面浏览器B/客户机C。 The information processing module is further arranged to send the exception information to the interface browser B/client C.
  11. 一种计算机可读存储介质,所述存储介质存储有计算机程序,该计算机程序包括程序指令,当该程序指令被服务器设备执行时,使得该设备可执行权利要求1-5任一项所述的方法。 A computer readable storage medium storing a computer program, the computer program comprising program instructions, when the program instruction is executed by a server device, causing the device to perform the method of any of claims 1-5 method.
PCT/CN2015/073265 2014-12-31 2015-02-25 Server self-healing method and device WO2016106965A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410852000.4 2014-12-31
CN201410852000.4A CN105808394B (en) 2014-12-31 2014-12-31 Server self-healing method and device

Publications (1)

Publication Number Publication Date
WO2016106965A1 true WO2016106965A1 (en) 2016-07-07

Family

ID=56284051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/073265 WO2016106965A1 (en) 2014-12-31 2015-02-25 Server self-healing method and device

Country Status (2)

Country Link
CN (1) CN105808394B (en)
WO (1) WO2016106965A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595307A (en) * 2018-05-03 2018-09-28 广州供电局有限公司 A kind of automatic self-healing method based on IT O&Ms
CN110187994A (en) * 2019-05-28 2019-08-30 北京星网锐捷网络技术有限公司 A kind of failure separation method, equipment and fault isolation system
CN112948160A (en) * 2021-02-26 2021-06-11 山东英信计算机技术有限公司 Method and device for positioning and repairing memory ECC problem
CN113535509A (en) * 2021-06-10 2021-10-22 中国长城科技集团股份有限公司 Memory bank abnormity detection method and device and BMC
CN113868001A (en) * 2021-09-10 2021-12-31 苏州浪潮智能科技有限公司 Method and system for checking memory repair result and computer storage medium
CN115269245A (en) * 2022-07-21 2022-11-01 超聚变数字技术有限公司 Memory fault processing method and computing device

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106789185A (en) * 2016-12-02 2017-05-31 国网四川省电力公司信息通信公司 A kind of information technoloy equipment management method based on outband management
CN107077408A (en) 2016-12-05 2017-08-18 华为技术有限公司 Method, computer system, baseboard management controller and the system of troubleshooting
CN107066361A (en) * 2017-04-17 2017-08-18 南京百敖软件有限公司 The method and apparatus that a kind of utilization BMC disables corrupted internal memory
CN107038098A (en) * 2017-04-28 2017-08-11 郑州云海信息技术有限公司 It is a kind of to pass through the method that network carries out server memory diagnosis in batches
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus
CN110262917A (en) * 2019-05-15 2019-09-20 平安科技(深圳)有限公司 Host self-healing method, device, computer equipment and storage medium
CN110457164A (en) * 2019-07-08 2019-11-15 华为技术有限公司 The method, apparatus and server of equipment management
CN112732477B (en) * 2021-04-01 2021-06-29 四川华鲲振宇智能科技有限责任公司 Method for fault isolation by out-of-band self-checking
CN113176963B (en) * 2021-04-29 2022-11-11 山东英信计算机技术有限公司 PCIe fault self-repairing method, device, equipment and readable storage medium
CN115495301A (en) * 2021-06-18 2022-12-20 华为技术有限公司 Fault processing method, device, equipment and system
CN113608908B (en) * 2021-07-28 2023-12-22 烽火超微信息科技有限公司 Server fault processing method, system, equipment and readable storage medium
CN114816822A (en) * 2022-05-07 2022-07-29 宝德计算机系统股份有限公司 Server management method, device and system based on memory fault
CN115080331A (en) * 2022-07-09 2022-09-20 超聚变数字技术有限公司 Fault processing method and computing device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681909A (en) * 2012-04-28 2012-09-19 浪潮电子信息产业股份有限公司 Server early-warning method based on memory errors
CN103425545A (en) * 2013-08-20 2013-12-04 浪潮电子信息产业股份有限公司 System fault tolerance method for multiprocessor server
US20140095948A1 (en) * 2012-09-28 2014-04-03 International Business Machines Corporation Memory testing in a data processing system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308603B2 (en) * 2004-10-18 2007-12-11 International Business Machines Corporation Method and system for reducing memory faults while running an operating system
CN102222025A (en) * 2011-06-17 2011-10-19 华为数字技术有限公司 Method and device for eliminating memory failure
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
CN103279406B (en) * 2013-05-31 2015-12-23 华为技术有限公司 A kind of partition method of internal memory and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681909A (en) * 2012-04-28 2012-09-19 浪潮电子信息产业股份有限公司 Server early-warning method based on memory errors
US20140095948A1 (en) * 2012-09-28 2014-04-03 International Business Machines Corporation Memory testing in a data processing system
CN103425545A (en) * 2013-08-20 2013-12-04 浪潮电子信息产业股份有限公司 System fault tolerance method for multiprocessor server

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595307A (en) * 2018-05-03 2018-09-28 广州供电局有限公司 A kind of automatic self-healing method based on IT O&Ms
CN110187994A (en) * 2019-05-28 2019-08-30 北京星网锐捷网络技术有限公司 A kind of failure separation method, equipment and fault isolation system
CN112948160A (en) * 2021-02-26 2021-06-11 山东英信计算机技术有限公司 Method and device for positioning and repairing memory ECC problem
CN112948160B (en) * 2021-02-26 2023-02-28 山东英信计算机技术有限公司 Method and device for positioning and repairing memory ECC problem
CN113535509A (en) * 2021-06-10 2021-10-22 中国长城科技集团股份有限公司 Memory bank abnormity detection method and device and BMC
CN113868001A (en) * 2021-09-10 2021-12-31 苏州浪潮智能科技有限公司 Method and system for checking memory repair result and computer storage medium
CN113868001B (en) * 2021-09-10 2023-08-08 苏州浪潮智能科技有限公司 Method, system and computer storage medium for checking memory repair result
CN115269245A (en) * 2022-07-21 2022-11-01 超聚变数字技术有限公司 Memory fault processing method and computing device
CN115269245B (en) * 2022-07-21 2024-03-19 超聚变数字技术有限公司 Memory fault processing method and computing device

Also Published As

Publication number Publication date
CN105808394A (en) 2016-07-27
CN105808394B (en) 2020-09-04

Similar Documents

Publication Publication Date Title
WO2016106965A1 (en) Server self-healing method and device
EP2674865A1 (en) MANAGEMENT COMPUTER AND METHOD FOR ROOT CAUSE ANALYSiS
WO2015169199A1 (en) Anomaly recovery method for virtual machine in distributed environment
EP2776928A1 (en) Systems and methods for automatic replacement and repair of communications network devices
WO2017107656A1 (en) Virtualized network element failure self-healing method and device
US8112518B2 (en) Redundant systems management frameworks for network environments
US11848889B2 (en) Systems and methods for improved uptime for network devices
CN104113428A (en) Apparatus management device and method
CN110865907B (en) Method and system for providing service redundancy between master server and slave server
CN114090184B (en) Method and equipment for realizing high availability of virtualization cluster
JP2013130901A (en) Monitoring server and network device recovery system using the same
CN108199901B (en) Hardware repair reporting method, system, device, hardware management server and storage medium
CN106294795A (en) A kind of data base's changing method and system
CN106411643B (en) BMC detection method and device
CN105849699B (en) Method for controlling data center architecture equipment
US8965993B2 (en) Entrusted management method for a plurality of rack systems
CN115190046B (en) Detection method, detection device and computing equipment of server cluster
CN113868001B (en) Method, system and computer storage medium for checking memory repair result
US20130138803A1 (en) Method for monitoring a plurality of rack systems
CN116069373A (en) BMC firmware upgrading method, device and medium thereof
US11954509B2 (en) Service continuation system and service continuation method between active and standby virtual servers
JP5631285B2 (en) Fault monitoring system and fault monitoring method
US20140297724A1 (en) Network element monitoring system and server
WO2014010021A1 (en) Information processing device, information processing system, method for controlling information processing device, and program for controlling information processing device
JP7436737B1 (en) Server management system that supports multi-vendors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15874658

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15874658

Country of ref document: EP

Kind code of ref document: A1