CN113110953A

CN113110953A - Error reporting collection method of multi-partition multi-node server

Info

Publication number: CN113110953A
Application number: CN202110402331.8A
Authority: CN
Inventors: 张莉
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-07-13

Abstract

The invention provides an error reporting collection method of a multi-partition multi-node server, which comprises the following steps: acquiring the partition condition of the computing nodes of the multi-path server, and acquiring a main computing node according to the partition condition so as to judge the main BMC and the nodes administered by the main BMC; the CPU sends the error reporting information to the CPLD of the node; the CPLD stores the error reporting information into an internal register and sends a signal to the main BMC; and the main BMC collects the information of the CPLD internal register of the managed node and positions the source node and the CPU of the error reporting information. Aiming at a multi-node server, the invention temporarily stores all abnormal information of the node through a CPLD register on a mainboard, and reads the state of the register in the CPLD by utilizing a BMC (baseboard management controller) to judge the specific position from which the specific abnormal information comes.

Description

Error reporting collection method of multi-partition multi-node server

Technical Field

The invention belongs to the technical field of multi-node servers, and particularly relates to an error reporting collection method of a multi-partition multi-node server.

Background

With the increasing application of servers, in government, financial, medical, energy and other industries, the demands for large core databases, virtualization integration, memory calculation and high-performance calculation are higher and higher, two-way, four-way and eight-way servers appear in succession, that is, a plurality of CPUs are concentrated on one main control board, so that the servers can execute in a multi-way and parallel manner, and the processing performance of the servers is greatly improved.

With the increasing rate of the protocols such as PCIE and UPI supported by the CPU and the increasing number of cores of the CPU, the power consumption of the CPU is increased. In the latest Eagle stream platform of Intel, the maximum power consumption of a CPU reaches 350W. In an eight-path server, in order to meet the requirements of high density and heat dissipation, two paths of main boards are designed, that is, four same main boards are combined together to form an eight-path system, so as to strive for a larger heat dissipation space.

In order to ensure the safe operation of the server, fault reporting is essential key information when the server works, and each computing node has respective fault reporting information. The error reporting information of the CPU includes thermprip, error0/1/2, PROC _ HOT, MEM _ HOT, etc., and for these signals, in the multi-path server, the CPLD, which is generally the main computing node in hardware, collects the information of all nodes and then reports it to the BMC in a unified manner. When the system fails, the reason of the system abnormality cannot be quickly judged. As shown in fig. one, in the single partition mode, the error reporting information of each node is summarized to the CPLD on the node 0, and then the CPLD0 is sent to the BMC0 on the node (when the node is single partition, the BMC master BMC of the node 0, and the CPLD is the master CPLD).

As can be seen from the figure, when a system fails, only what kind of failure occurs in the system can be determined, and an abnormal node or an abnormal CPU cannot be determined quickly. And for the multi-node server, the CPLD of the main computing node collects error reporting information of all nodes, so that the number of signals for reporting errors to the BMC by the CPLD can be increased, the number of GPIOs used by the CPLD chip can also be increased, the CPLD of more GPIOs needs to be selected, and the cost is increased.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides an error reporting collection method for a multi-partition multi-node server to solve the above-mentioned technical problems.

The invention provides an error reporting collection method of a multi-partition multi-node server, which comprises the following steps:

acquiring the partition condition of the computing nodes of the multi-path server, and acquiring a main computing node according to the partition condition so as to judge the main BMC and the nodes administered by the main BMC;

the CPU sends the error reporting information to the CPLD of the node;

the CPLD stores the error reporting information into an internal register and sends a signal to the main BMC;

and the main BMC collects the information of the CPLD internal register of the managed node and positions the source node and the CPU of the error reporting information.

Further, the method further comprises:

and storing the error reporting information in a system log of the main BMC.

Further, the obtaining the partition condition of the computing node of the multi-path server includes:

judging the partition condition of the computing node according to the MODE signal of the jump cap on the management board;

the multi-path server is an eight-path server, and the partition condition of the computing node comprises: single partition, double partition, quad partition.

Further, the method further comprises:

and acquiring an MS signal on the mainboard, and acquiring the master-slave relationship of the computing node according to the MS signal.

Further, the obtaining of the main computing node according to the partition condition so as to determine the main BMC and the nodes governed by the main BMC includes:

the single partition is provided with a main computing node, and the BMC of the main computing node governs error reporting information of other three computing nodes except the main computing node;

the double partitions are provided with two main computing nodes, and the BMCs of the two main computing nodes respectively govern error reporting information of one computing node;

all the computing nodes in the four partitions are main computing nodes, and the respective BMC governs error reporting information on the respective node.

Further, in the above-mentioned case,

the CPU error reporting information comprises: an overheating trigger power-off alarm signal, an operation error signal, a processor overheating signal and a memory overheating signal.

The beneficial effect of the invention is that,

the invention provides an error reporting collection method of a multi-partition multi-node server, aiming at the multi-node server, temporarily storing all abnormal information of a node through a CPLD register on a mainboard, and reading the state of a register in the CPLD by using a BMC (baseboard management controller) to judge the specific position from which specific abnormal information comes; and the log can be read through the BMC, the specific position of error reporting information can be quickly positioned, the time and the cost for maintaining an abnormal machine are reduced, and the maintenance efficiency is improved.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of the prior art structure of the present invention.

FIG. 2 is a schematic flow diagram of a method of one embodiment of the invention.

FIG. 3 is a schematic diagram of the partitioning and node location of the compute nodes of the eight-way server in one embodiment of the invention.

Fig. 4 is a schematic diagram of an error reporting mechanism of an eight-way server according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The following explains key terms appearing in the present invention.

CPLD: a Complex Programmable Logic Device is a short for Complex PLD, a Logic element more Complex than PLD.

BMC: a Basebard Management Controller, a Baseboard Management Controller.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 2, the method 100 includes:

step 110, obtaining the partition condition of the computing nodes of the multi-path server, and obtaining a main computing node according to the partition condition, thereby judging the main BMC and the nodes governed by the main BMC;

step 120, the CPU sends the error information to the CPLD of the node;

step 130, the CPLD stores the error reporting information in an internal register and sends a signal to the main BMC;

step 140, the master BMC collects the CPLD internal register information of the managed node, and locates the source node and the CPU reporting the error information.

Optionally, as an embodiment of the present invention, the method further includes:

and storing the error reporting information in a system log of the main BMC.

Optionally, as an embodiment of the present invention, the obtaining the partition condition of the compute node of the multi-way server includes:

Optionally, as an embodiment of the present invention, the obtaining a main computing node according to a partition condition, so as to determine the main BMC and its managed nodes includes:

Alternatively, as an embodiment of the present invention,

For a multi-path server, such as an 8-path server, each computing node has management chips such as BMC and CPLD, and error reporting signals output by the CPU include: THERMTRIP signals, ERROR [2:0] signals PROC _ HOT signals, MEM _ HOT _ OUT signals, and the like.

Each CPU sends the error signal to CPLD, when CPLD receives the information, it stores the information in the internal register, then sends alert signal to tell BMC that it has received error. After receiving the alert signal, the BMC reads the CPLD internal register information of all the nodes, so that the BMC of each main computing node can know which CPU of which node reports the error information, and then displays the abnormal information in the log. When the server needs to be maintained, the log of the BMC can be quickly passed to know which error is reported at which position, and the system can be maintained more pertinently.

In order to facilitate understanding of the present invention, the principle of the error reporting collection method of the server of the present invention is described below, taking an eight-way server as an example, and referring to fig. 3 and 4, a further description is made of the error reporting collection method of a multi-partition multi-node server provided by the present invention.

Two jump caps are arranged on the management board, eight paths of servers can be set to be a single partition, a double partition and a four partition, wherein a MODE [1:0] is 11 and represents that the system is the single partition, a MODE [1:0] is 10/01 and represents that the system is the double partition, and a MODE [1:0] is 00 and represents that the system is the four partition. Each mainboard is provided with an MS [1:0] signal connected to the middle back panel, when MS [1:0] is equal to 11, the position of the computing node is represented at 0, when MS [1:0] is equal to 10, the position of the computing node is represented at 1, when MS [1:0] is equal to 01, the position of the computing node is represented at 2, and when MS [1:0] is equal to 00, the position of the computing node is represented at 3.

When the system is a single partition, the computing node 0 is a main computing node, and the BMC needs to gather information on other 3 nodes; when the system is a double-partition area, the computing nodes 0 and 2 are both main computing nodes, the BMC0 needs to collect information on the node 1, and the BMC2 needs to collect information on the node 3; when the system is a four-partition, each compute node is a master compute node, and the respective BMCs gather information on the respective nodes. The BMC on the computing node judges whether the BMC is on a main computing node or not through a signal of a partition mode and the position of the node, if so, the BMC is used as the main BMC to access CPLDs of all nodes through the SMbus after receiving an alert signal, and obtains information from an internal register of the CPLD, and if so, the BMC cannot access any CPLD through the SMbus, so as to avoid SMbus link abnormity.

The CPLD may set the address of its SMBUS, with node 0 set to 0X20, node 1 set to 0X22, node 2 set to 0X24, and node 3 set to 0X26, depending on where the nodes are located. The SMBUS address of the CPLD is bound with the node position, and the addresses between the nodes are different, so that the problem of address conflict can be avoided when the main BMC accesses the CPLD of the node and the CPLDs of other slave nodes in any partition mode.

Take thermsprip signal of CPU1 in a single partition as an example. When the temperature of the CPU1 reaches the maximum safe operating temperature, the CPU1 sends a CPU1_ THERMTRIP _ N signal to the CPLD1, and after receiving the signal, the CPLD1 sets the register of thermtrip representing the CPU1 to valid, and then sends an alert signal to the main BMC 0. After receiving the CPLD _ ALERT _ N signal, the BMC0 sends a command to the CPLDs of all the nodes to read the register information inside all the CPLDs. The BMC analyzes the read content, judges which node CPLD internal register is set specifically, so as to be positioned to the CPU1, then continuously judges what errors are reported by the CPU1, stores the abnormal information in a log and displays the abnormal information in a UI interface. Therefore, a user can directly judge the abnormal node through log information, and maintain the abnormal node in a targeted manner without spending time on the normal node.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An error reporting collection method for a multi-partition multi-node server, comprising:

the CPU sends the error reporting information to the CPLD of the node;

2. The method of claim 1, further comprising:

and storing the error reporting information in a system log of the main BMC.

3. The method of claim 1, wherein the obtaining the partition status of the compute nodes of the multi-partition multi-node server comprises:

4. The method of claim 1, further comprising:

5. The method of claim 3, wherein the step of obtaining the master computing node according to the partition condition to determine the master BMC and its managed nodes comprises:

6. The method of claim 1,