WO2016106965A1

WO2016106965A1 - Server self-healing method and device

Info

Publication number: WO2016106965A1
Application number: PCT/CN2015/073265
Authority: WO
Inventors: 李军
Original assignee: 中兴通讯股份有限公司
Priority date: 2014-12-31
Filing date: 2015-02-25
Publication date: 2016-07-07
Also published as: CN105808394A; CN105808394B

Abstract

Provided is a server self-healing method, said method comprising: a baseboard management controller (BMC) receives exception information sent by a basic input/output system (BIOS), said exception information comprising memory exception type and exception memory module identifier; according to said exception information, the BMC or system management module (SMM) generates quarantine memory information, and processes the board accordingly; the BMC sends the quarantine memory information to the BIOS, said quarantine memory information being used for instructing the BIOS to quarantine the corresponding exception memory. By means of the coordination of the BMC, the BIOS, and the SMM, the described solution accomplishes automatic self-healing of a server, which reduces the possibility of on-site manual intervention and operation and restores the server to a state of normal operation as quickly as possible.

Description

Method and device for self-healing of server

Technical field

The present invention relates to the field of servers, and in particular to a method and apparatus for server self-healing.

Background technique

At present, operators face enormous challenges. They must be able to quickly integrate network resources to provide users with the latest services. At the same time, they must reduce network procurement costs, operation and maintenance costs, and failure recovery time. A large number of servers owned by the operator install a large amount of memory, because the memory failure causes the server to be abnormal, which reduces the stability of the service provided by the operator, and increases the recovery time and maintenance cost.

On the server, the BMC (Baseband Management Controller) monitors the working status of the server. The management server is powered on and off. When the server is abnormal, the alarm is processed and alarmed. The BMC exists as a stand-alone firmware. It can accept the SMM command and report the monitored server exception information to the SMM (System Management Module). It can also provide B/C (Browser/Client, management interface). The browser/client) accepts the B/C control command or the issued control policy and returns the current or historical health status of the B/C server. The reliability of the server memory directly affects the stability and reliability of the board. If the memory is faulty, the service is interrupted. In severe cases, the system may be down. Although most high-performance, high-reliability servers use memory with ECC (Error Checking and Correcting), the reliability of the system is limited. There are mainly the following aspects: First, after the ECC error that can be corrected, although the memory with this ECC function can be automatically corrected, if it occurs frequently, it indicates that this memory has serious hidden danger, so this automatic error correction The processing method is relatively passive, because the serious hidden dangers of the system are not ruled out. Second, after uncorrectable ECC or other unrecoverable errors, the system will have serious consequences such as blue screen or downtime. If there is no out-of-band participation Only the on-site personnel can shut down the server and replace the memory.

Summary of the invention

Embodiments of the present invention provide a method and apparatus for server self-healing to reduce the problem of manual field intervention and operation server failure.

To solve the above technical problem, an embodiment of the present invention provides a method for server self-healing, the method comprising:

The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, and the abnormal information includes a memory exception type and an abnormal memory module identifier;

The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board;

The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.

Optionally, the memory exception type includes an unrecoverable memory error;

The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:

When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the memory module identifier, and Performing a power-off and power-on operation on the server board;

or,

When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM, The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.

Optionally, the exception type includes a recoverable memory error;

When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error count on the abnormal memory bar corresponding to the abnormality information. And frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC according to the abnormal memory Generates isolated memory information and performs power-off and power-on operations on the server board.

or,

When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM; the SMM pair The abnormal memory bar corresponding to the abnormal information performs the recoverable memory error frequency and frequency statistics. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormal memory module. The information is generated to isolate the memory information, and the server board is powered off and then powered on.

Optionally, the sending, by the BMC, the isolated memory information to the BIOS includes:

The BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.

Optionally, after receiving the abnormal information sent by the BIOS, the BMC further includes:

The BMC sends the exception information to the interface browser B/client C.

The invention also provides a device for self-healing of a server, the device comprising:

An information processing module is configured to receive abnormal information sent by a BIOS of a basic input/output system of the server board, where the abnormal information includes an abnormal type and an abnormal memory stick identifier;

An exception processing module is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;

The isolation module is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding memory.

Optionally, the exception handling module is set to:

When the memory abnormality type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is configured to have a healing function, the isolated memory information is generated by the BMC according to the abnormal memory module identifier, and Performing a power-off and power-on operation on the server board;

Or,

And when the memory abnormality type of the abnormal information received by the BMC is an unrecoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM, by the The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.

Optionally, the exception handling module is configured to:

When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information. The number of times and frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;

or,

And when the memory abnormality type of the abnormal information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM; The SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormality. The information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.

Optionally, the isolating module is configured to send the isolated memory information to the BIOS, where:

And sending, by the BMC, the isolated memory information generated by the BMC to the BIOS; or, after receiving, by the BMC, the isolated memory information generated by the SMM, sending the isolated memory information to the BIOS.

Optionally, the information processing module is further configured to send the exception information to the interface browser B/client C.

The embodiment of the present invention further provides a computer readable storage medium, where the storage medium stores a computer program, where the computer program includes program instructions, when the program instruction is executed by the server device, enabling the device to execute the server itself. The more the method.

The above solution passes BMC, BIOS (Basic Input Output System, basic input and output system With the cooperation of SMC and the SMC, the server automatically self-healing, reducing the possibility of manual intervention and operation, and restoring the normal working state of the server as soon as possible.

BRIEF abstract

1 is a schematic structural diagram of a server management system according to an embodiment of the present invention;

2 is a flow chart of a method for self-healing of a server according to an embodiment of the present invention;

3 is a schematic structural diagram of a device for self-healing of a server according to an embodiment of the present invention;

4 is a flow chart of a method for server self-healing according to another embodiment of the present invention.

Preferred embodiment of the invention

Embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

Embodiment 1

The server management system structure shown in FIG. 1 includes an SMM and a plurality of slave nodes, that is, BMCs on each server, and each server has a BIOS. The SMM is connected to the BMC of each server through various methods such as IPMB (Intelligent Platform Management BUS)/LAN (Local Area Network), and the BMC and the BIOS can communicate through various types of physical channels. The system structure provides a physical channel for SMM to manage server memory anomalies. In the server system, the server uses the memory that supports the ECC function, and provides hardware prerequisites for timely discovering memory exceptions. The main function of the B/C is to configure the BMC to handle memory exceptions. For example, configure a policy, such as restarting a board and isolating the fault when the frequency of recoverable memory faults of a certain memory module is greater than a certain threshold. In addition, the B/C can also query the memory failure and provide a power-down interface on the board.

As shown in FIG. 2, an embodiment of the present invention provides a method for self-healing a server, where the method includes:

Step S100: The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, where the abnormal information includes a memory exception type and an abnormal memory barcode. knowledge;

Step S102: The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board.

Step S104: The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.

Preferably, the exception type includes an unrecoverable memory error;

The BMC or the system management module SMM generates the isolated memory information according to the abnormal information, and performs corresponding processing on the board, including:

When the abnormality type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is configured with a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier, and generates The board performs power-off and power-on operations;

or,

When the abnormal type of the abnormality information received by the BMC is an unrecoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM, where the SMM is The abnormal memory module identifier generates the isolated memory information, and performs power-off and power-on operations on the board.

Preferably, the type of exception includes a recoverable memory error;

When the abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error number of the abnormal memory bar corresponding to the abnormality information. Frequency statistics: When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates isolated memory information based on the information of the abnormal memory module, and performs power-off on the board. Power-on operation;

or,

When the abnormal type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM; the SMM The abnormal memory corresponding to the exception information performs a recoverable memory error. The SMM generates the isolated memory information according to the information of the abnormal memory module, and performs the execution on the board when the number of the recoverable memory errors or the frequency reaches the set isolation threshold. The power is then powered on.

Preferably, the sending, by the BMC, the isolated memory information to the BIOS includes:

Preferably, after receiving the abnormality information sent by the BIOS, the BMC further includes:

The BMC sends the exception information to the interface browser B/client C.

As shown in FIG. 3, an embodiment of the present invention further provides a device for self-healing a server, where the device includes a processor, a program storage device, and a data storage device, and further includes:

The information processing module 11 is configured to receive abnormal information sent by the BIOS of the basic input/output system of the server board, where the abnormal information includes a memory abnormal type and an abnormal memory barcode identifier;

The exception processing module 12 is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;

The isolation module 13 is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.

Preferably, the exception type includes an unrecoverable memory error;

The exception handling module 12 is configured to generate isolated memory information according to the abnormal information, and perform corresponding processing on the board:

When the abnormal type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured with a healing function, the isolated memory information is generated by the BMC according to the abnormal memory module identifier. And performing power-off and power-on operations on the server board;

or,

When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is not configured with a healing function, the abnormal information is used by the BMC Forwarding to the SMM, the SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.

Preferably, the type of exception includes a recoverable memory error;

or,

Preferably, the isolating module 13 is adapted to send the isolated memory information to the BIOS:

Preferably, the information processing module 11 is further adapted to send the exception information to the interface browser B/client C.

Embodiment 2

FIG. 4 is a flowchart of a method for self-healing a server according to another embodiment of the present invention. among them:

The BIOS is responsible for detecting memory exceptions, distinguishing between one ECC error that can be recovered and two unrecoverable ECC errors, and can locate the fault to a specific physical memory stick; if the system starts again after self-healing, the abnormal memory stick can be implemented. Isolation, no longer used.

The BMC is responsible for forwarding the memory exception reported by the BIOS to the SMM, or directly completing the SMM function described in step 3, and reporting the faulty memory bar information to the basic input/output system BIOS when the server is powered on again.

The SMM receives the memory fault information forwarded by the out-of-band management module, and distinguishes the memory module from the abnormal number. Based on the abnormality of the memory and the frequency of the abnormality, the SMM determines whether to perform self-healing processing on the specified abnormal board.

In this embodiment, the server has a memory error in the BIOS startup phase or in the OS running phase, and the error can be detected by the BIOS; the BIOS parses out the memory module corresponding to the memory error and reports it to the BMC; the BMC reports the memory error. Give SMM, or provide a B/C query.

At the same time, BMC or SMM also counts the number of different types of errors that occur during a period of time, which can be counted on a per-memory basis.

It should be noted that, in the process of the server healing in this embodiment, different processing flows are performed according to the type of memory error, and the memory error types include unrecoverable memory errors and recoverable memory errors. The unrecoverable memory error and recoverable memory error handling flow are described as follows.

First, for unrecoverable memory errors

Step A: The BMC of the server board receives an unrecoverable memory error reported by the BIOS, or the SMM receives an unrecoverable memory error from the BMC. The BMC/SMM automatically powers off the board and then powers on the board. And then perform step B;

Step B: After the board is powered on again, the BMC actively sends the number of the memory module that detected the unrecoverable fault to the BIOS, and performs step C;

Step C: After receiving the BIOS, the memory module that has an unrecoverable fault is masked, that is, the memory that has an unrecoverable fault is not used after the startup.

Through the above operations, automatic self-healing processing for such serious memory errors can be achieved, and manual intervention is reduced, otherwise such faults need to be solved by on-site personnel intervention.

Second, for recoverable memory errors

Step A: The BMC of the server board receives a recoverable memory error reported by the BIOS, or the SMM receives a recoverable memory error reported by the BMC, and records the number and frequency of such abnormalities according to the memory, and The set threshold is compared, if the set threshold is reached, the BMC/SMM automatically performs power-off and power-on processing on the board, and then performs step B;

Step B: After the board is powered on again, the BMC actively sends the last detected memory code of the recoverable fault that reaches the set threshold to the BIOS, and performs step C;

It should be noted that, if the SMM is powered off and then powered on, the SMM generates the code of the memory to be isolated, and then sends the code to the BMC, which is forwarded to the BIOS by the BMC.

Step C: After receiving the BIOS, the memory strips reported by the BMC are masked, that is, the memory modules are not used after the startup.

The above operation can ensure that the memory with frequent abnormalities is isolated, and the memory containing the hidden problem is automatically isolated in advance to ensure the stability and reliability of the system.

It should be noted that if the BIOS sends an unrecoverable fault, the corresponding memory stick in the fault information sent by the BIOS is a memory strip that needs to be isolated. If the recoverable fault is received by the BMC, the number of times of the corresponding memory is counted. The memory stick that reaches the isolation threshold needs to be isolated. At the same time, for the recoverable fault, the number of times that a certain period of time occurs, that is, the frequency threshold, or the total number of thresholds can be set, and different strategies can be configured according to specific implementation requirements.

The above technical solution detects the abnormality of the server memory by the BMC, the SMM, and the BIOS, and performs self-healing control according to the set policy. The memory abnormality can be refined to a specific memory bar, according to a serious abnormality of the memory, and a specific memory bar within a fixed time. An abnormal frequency occurs to determine whether the board is powered off and then powered on. When the power is initialized again, the abnormal memory is isolated and is no longer used. This avoids the inability of the original server to automatically recover when the memory is abnormal, and the trouble of manual recovery on site must reduce the possibility of manual intervention when an abnormality occurs, and also greatly improve the reliability of the system and accelerate the recovery time of the server. .

It should be emphasized that those skilled in the art should understand the policies covered in the embodiments of the present invention. The steps may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device. Implementations such that they can be stored in a storage device by a computing device, or fabricated separately into individual integrated circuit modules, or a plurality of modules or steps thereof can be implemented as a single integrated circuit module.

The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention. One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Optionally, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits. Accordingly, each module/module in the foregoing embodiment may be implemented in the form of hardware, or may be implemented by using a software function module. Formal realization. This application is not limited to any specific combination of hardware and software.

Industrial applicability

The technical solution provided by the present invention detects the abnormality of the server memory by the BMC, the SMM, and the BIOS, and isolates the abnormal memory bar according to the abnormality of the memory and the abnormal frequency of the specific memory bar in a fixed time. It can avoid the original server can not automatically recover when the memory is abnormal, the trouble of manual recovery on site, reduce the possibility of manual intervention when an abnormality occurs, and can greatly improve the reliability of the system and accelerate the recovery time of the server.

Claims

A method for server self-healing, including:

The outband management module BMC of the server board receives the abnormal information sent by the BIOS of the basic input/output system, and the abnormal information includes a memory exception type and an abnormal memory module identifier;

The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board;

The BMC sends the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
The method of claim 1 wherein:

The memory exception type includes an unrecoverable memory error;

The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:

When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier. And performing power-off and power-on operations on the server board;

or,

When the memory abnormality type of the abnormality information received by the BMC is the unrecoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM, The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
The method of claim 1 wherein:

The content exception type includes a recoverable memory error;

The BMC or the system management module SMM generates the isolated memory information according to the abnormality information, and performs corresponding processing on the server board, including:

When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured with a healing function, the BMC performs a recoverable memory error count on the abnormal memory bar corresponding to the abnormality information. And frequency statistics; when statistics of recoverable memory errors are counted When the number or frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory module, and performs power-off and power-on operations on the server board;

or,

When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the BMC forwards the abnormality information to the SMM; the SMM pair The abnormal memory bar corresponding to the abnormal information performs the recoverable memory error frequency and frequency statistics. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormal memory module. The information is generated to isolate the memory information, and the server board is powered off and then powered on.
A method as claimed in any one of claims 1 to 3 wherein:

Sending, by the BMC, the isolated memory information to the BIOS includes:

The BMC sends the isolated memory information generated by the BMC to the BIOS; or the BMC sends the isolated memory information to the BIOS after receiving the isolated memory information generated by the SMM.
The method of claim 4 wherein:

After receiving the abnormality information sent by the BIOS, the BMC further includes:

The BMC sends the exception information to the interface browser B/client C.
A device for self-healing of a server, comprising:

An information processing module is configured to receive abnormal information sent by a BIOS of a basic input/output system of the server board, where the abnormal information includes a memory exception type and an abnormal memory module identifier;

An exception processing module is configured to generate isolated memory information according to the abnormality information, and perform corresponding processing on the server board;

The isolation module is configured to send the isolated memory information to the BIOS, where the isolated memory information is used to instruct the BIOS to isolate the corresponding abnormal memory.
The apparatus of claim 6 wherein:

The exception handling module is set to:

The memory exception type of the exception information received by the BMC is an unrecoverable memory error If the BMC is configured to have a healing function, the BMC generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.

or,

And when the memory abnormality type of the abnormal information received by the BMC is an unrecoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM, by the The SMM generates the isolated memory information according to the abnormal memory module identifier, and performs a power-off and power-on operation on the server board.
The apparatus of claim 6 wherein:

The exception handling module is set to:

When the memory abnormality type of the abnormality information received by the BMC is a recoverable memory error, and the BMC is configured to have a healing function, the BMC performs a recoverable memory error on the abnormal memory module corresponding to the abnormality information. The number of times and frequency statistics; when the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the BMC generates the isolated memory information according to the information of the abnormal memory, and the board of the server Perform power-off and power-on operation;

or,

And when the memory abnormality type of the abnormal information received by the BMC is a recoverable memory error, and the BMC is not configured with a healing function, the abnormal information is forwarded by the BMC to the SMM; The SMM performs recoverable memory error times and frequency statistics on the abnormal memory bar corresponding to the abnormal information. When the counted number of recoverable memory errors or the frequency reaches the set isolation threshold, the SMM is based on the abnormality. The information of the memory module generates the isolated memory information, and performs power-off and power-on operations on the server board.
A device according to any of claims 6 to 8, wherein:

The isolating module is configured to send the isolated memory information to the BIOS, where:

And sending, by the BMC, the isolated memory information generated by the BMC to the BIOS; or, after receiving, by the BMC, the isolated memory information generated by the SMM, sending the isolated memory information to the BIOS .
The apparatus of claim 9 wherein:

The information processing module is further arranged to send the exception information to the interface browser B/client C.
A computer readable storage medium storing a computer program, the computer program comprising program instructions, when the program instruction is executed by a server device, causing the device to perform the method of any of claims 1-5 method.