CN113110953A - Error reporting collection method of multi-partition multi-node server - Google Patents

Error reporting collection method of multi-partition multi-node server Download PDF

Info

Publication number
CN113110953A
CN113110953A CN202110402331.8A CN202110402331A CN113110953A CN 113110953 A CN113110953 A CN 113110953A CN 202110402331 A CN202110402331 A CN 202110402331A CN 113110953 A CN113110953 A CN 113110953A
Authority
CN
China
Prior art keywords
node
partition
main
bmc
error reporting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110402331.8A
Other languages
Chinese (zh)
Inventor
张莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yingxin Computer Technology Co Ltd
Original Assignee
Shandong Yingxin Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yingxin Computer Technology Co Ltd filed Critical Shandong Yingxin Computer Technology Co Ltd
Priority to CN202110402331.8A priority Critical patent/CN113110953A/en
Publication of CN113110953A publication Critical patent/CN113110953A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0772Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides an error reporting collection method of a multi-partition multi-node server, which comprises the following steps: acquiring the partition condition of the computing nodes of the multi-path server, and acquiring a main computing node according to the partition condition so as to judge the main BMC and the nodes administered by the main BMC; the CPU sends the error reporting information to the CPLD of the node; the CPLD stores the error reporting information into an internal register and sends a signal to the main BMC; and the main BMC collects the information of the CPLD internal register of the managed node and positions the source node and the CPU of the error reporting information. Aiming at a multi-node server, the invention temporarily stores all abnormal information of the node through a CPLD register on a mainboard, and reads the state of the register in the CPLD by utilizing a BMC (baseboard management controller) to judge the specific position from which the specific abnormal information comes.

Description

Error reporting collection method of multi-partition multi-node server
Technical Field
The invention belongs to the technical field of multi-node servers, and particularly relates to an error reporting collection method of a multi-partition multi-node server.
Background
With the increasing application of servers, in government, financial, medical, energy and other industries, the demands for large core databases, virtualization integration, memory calculation and high-performance calculation are higher and higher, two-way, four-way and eight-way servers appear in succession, that is, a plurality of CPUs are concentrated on one main control board, so that the servers can execute in a multi-way and parallel manner, and the processing performance of the servers is greatly improved.
With the increasing rate of the protocols such as PCIE and UPI supported by the CPU and the increasing number of cores of the CPU, the power consumption of the CPU is increased. In the latest Eagle stream platform of Intel, the maximum power consumption of a CPU reaches 350W. In an eight-path server, in order to meet the requirements of high density and heat dissipation, two paths of main boards are designed, that is, four same main boards are combined together to form an eight-path system, so as to strive for a larger heat dissipation space.
In order to ensure the safe operation of the server, fault reporting is essential key information when the server works, and each computing node has respective fault reporting information. The error reporting information of the CPU includes thermprip, error0/1/2, PROC _ HOT, MEM _ HOT, etc., and for these signals, in the multi-path server, the CPLD, which is generally the main computing node in hardware, collects the information of all nodes and then reports it to the BMC in a unified manner. When the system fails, the reason of the system abnormality cannot be quickly judged. As shown in fig. one, in the single partition mode, the error reporting information of each node is summarized to the CPLD on the node 0, and then the CPLD0 is sent to the BMC0 on the node (when the node is single partition, the BMC master BMC of the node 0, and the CPLD is the master CPLD).
As can be seen from the figure, when a system fails, only what kind of failure occurs in the system can be determined, and an abnormal node or an abnormal CPU cannot be determined quickly. And for the multi-node server, the CPLD of the main computing node collects error reporting information of all nodes, so that the number of signals for reporting errors to the BMC by the CPLD can be increased, the number of GPIOs used by the CPLD chip can also be increased, the CPLD of more GPIOs needs to be selected, and the cost is increased.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides an error reporting collection method for a multi-partition multi-node server to solve the above-mentioned technical problems.
The invention provides an error reporting collection method of a multi-partition multi-node server, which comprises the following steps:
acquiring the partition condition of the computing nodes of the multi-path server, and acquiring a main computing node according to the partition condition so as to judge the main BMC and the nodes administered by the main BMC;
the CPU sends the error reporting information to the CPLD of the node;
the CPLD stores the error reporting information into an internal register and sends a signal to the main BMC;
and the main BMC collects the information of the CPLD internal register of the managed node and positions the source node and the CPU of the error reporting information.
Further, the method further comprises:
and storing the error reporting information in a system log of the main BMC.
Further, the obtaining the partition condition of the computing node of the multi-path server includes:
judging the partition condition of the computing node according to the MODE signal of the jump cap on the management board;
the multi-path server is an eight-path server, and the partition condition of the computing node comprises: single partition, double partition, quad partition.
Further, the method further comprises:
and acquiring an MS signal on the mainboard, and acquiring the master-slave relationship of the computing node according to the MS signal.
Further, the obtaining of the main computing node according to the partition condition so as to determine the main BMC and the nodes governed by the main BMC includes:
the single partition is provided with a main computing node, and the BMC of the main computing node governs error reporting information of other three computing nodes except the main computing node;
the double partitions are provided with two main computing nodes, and the BMCs of the two main computing nodes respectively govern error reporting information of one computing node;
all the computing nodes in the four partitions are main computing nodes, and the respective BMC governs error reporting information on the respective node.
Further, in the above-mentioned case,
the CPU error reporting information comprises: an overheating trigger power-off alarm signal, an operation error signal, a processor overheating signal and a memory overheating signal.
The beneficial effect of the invention is that,
the invention provides an error reporting collection method of a multi-partition multi-node server, aiming at the multi-node server, temporarily storing all abnormal information of a node through a CPLD register on a mainboard, and reading the state of a register in the CPLD by using a BMC (baseboard management controller) to judge the specific position from which specific abnormal information comes; and the log can be read through the BMC, the specific position of error reporting information can be quickly positioned, the time and the cost for maintaining an abnormal machine are reduced, and the maintenance efficiency is improved.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of the prior art structure of the present invention.
FIG. 2 is a schematic flow diagram of a method of one embodiment of the invention.
FIG. 3 is a schematic diagram of the partitioning and node location of the compute nodes of the eight-way server in one embodiment of the invention.
Fig. 4 is a schematic diagram of an error reporting mechanism of an eight-way server according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.
The following explains key terms appearing in the present invention.
CPLD: a Complex Programmable Logic Device is a short for Complex PLD, a Logic element more Complex than PLD.
BMC: a Basebard Management Controller, a Baseboard Management Controller.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 2 is a schematic flow diagram of a method of one embodiment of the invention.
As shown in fig. 2, the method 100 includes:
step 110, obtaining the partition condition of the computing nodes of the multi-path server, and obtaining a main computing node according to the partition condition, thereby judging the main BMC and the nodes governed by the main BMC;
step 120, the CPU sends the error information to the CPLD of the node;
step 130, the CPLD stores the error reporting information in an internal register and sends a signal to the main BMC;
step 140, the master BMC collects the CPLD internal register information of the managed node, and locates the source node and the CPU reporting the error information.
Optionally, as an embodiment of the present invention, the method further includes:
and storing the error reporting information in a system log of the main BMC.
Optionally, as an embodiment of the present invention, the obtaining the partition condition of the compute node of the multi-way server includes:
judging the partition condition of the computing node according to the MODE signal of the jump cap on the management board;
the multi-path server is an eight-path server, and the partition condition of the computing node comprises: single partition, double partition, quad partition.
Optionally, as an embodiment of the present invention, the method further includes:
and acquiring an MS signal on the mainboard, and acquiring the master-slave relationship of the computing node according to the MS signal.
Optionally, as an embodiment of the present invention, the obtaining a main computing node according to a partition condition, so as to determine the main BMC and its managed nodes includes:
the single partition is provided with a main computing node, and the BMC of the main computing node governs error reporting information of other three computing nodes except the main computing node;
the double partitions are provided with two main computing nodes, and the BMCs of the two main computing nodes respectively govern error reporting information of one computing node;
all the computing nodes in the four partitions are main computing nodes, and the respective BMC governs error reporting information on the respective node.
Alternatively, as an embodiment of the present invention,
the CPU error reporting information comprises: an overheating trigger power-off alarm signal, an operation error signal, a processor overheating signal and a memory overheating signal.
For a multi-path server, such as an 8-path server, each computing node has management chips such as BMC and CPLD, and error reporting signals output by the CPU include: THERMTRIP signals, ERROR [2:0] signals PROC _ HOT signals, MEM _ HOT _ OUT signals, and the like.
Each CPU sends the error signal to CPLD, when CPLD receives the information, it stores the information in the internal register, then sends alert signal to tell BMC that it has received error. After receiving the alert signal, the BMC reads the CPLD internal register information of all the nodes, so that the BMC of each main computing node can know which CPU of which node reports the error information, and then displays the abnormal information in the log. When the server needs to be maintained, the log of the BMC can be quickly passed to know which error is reported at which position, and the system can be maintained more pertinently.
In order to facilitate understanding of the present invention, the principle of the error reporting collection method of the server of the present invention is described below, taking an eight-way server as an example, and referring to fig. 3 and 4, a further description is made of the error reporting collection method of a multi-partition multi-node server provided by the present invention.
Two jump caps are arranged on the management board, eight paths of servers can be set to be a single partition, a double partition and a four partition, wherein a MODE [1:0] is 11 and represents that the system is the single partition, a MODE [1:0] is 10/01 and represents that the system is the double partition, and a MODE [1:0] is 00 and represents that the system is the four partition. Each mainboard is provided with an MS [1:0] signal connected to the middle back panel, when MS [1:0] is equal to 11, the position of the computing node is represented at 0, when MS [1:0] is equal to 10, the position of the computing node is represented at 1, when MS [1:0] is equal to 01, the position of the computing node is represented at 2, and when MS [1:0] is equal to 00, the position of the computing node is represented at 3.
When the system is a single partition, the computing node 0 is a main computing node, and the BMC needs to gather information on other 3 nodes; when the system is a double-partition area, the computing nodes 0 and 2 are both main computing nodes, the BMC0 needs to collect information on the node 1, and the BMC2 needs to collect information on the node 3; when the system is a four-partition, each compute node is a master compute node, and the respective BMCs gather information on the respective nodes. The BMC on the computing node judges whether the BMC is on a main computing node or not through a signal of a partition mode and the position of the node, if so, the BMC is used as the main BMC to access CPLDs of all nodes through the SMbus after receiving an alert signal, and obtains information from an internal register of the CPLD, and if so, the BMC cannot access any CPLD through the SMbus, so as to avoid SMbus link abnormity.
The CPLD may set the address of its SMBUS, with node 0 set to 0X20, node 1 set to 0X22, node 2 set to 0X24, and node 3 set to 0X26, depending on where the nodes are located. The SMBUS address of the CPLD is bound with the node position, and the addresses between the nodes are different, so that the problem of address conflict can be avoided when the main BMC accesses the CPLD of the node and the CPLDs of other slave nodes in any partition mode.
Take thermsprip signal of CPU1 in a single partition as an example. When the temperature of the CPU1 reaches the maximum safe operating temperature, the CPU1 sends a CPU1_ THERMTRIP _ N signal to the CPLD1, and after receiving the signal, the CPLD1 sets the register of thermtrip representing the CPU1 to valid, and then sends an alert signal to the main BMC 0. After receiving the CPLD _ ALERT _ N signal, the BMC0 sends a command to the CPLDs of all the nodes to read the register information inside all the CPLDs. The BMC analyzes the read content, judges which node CPLD internal register is set specifically, so as to be positioned to the CPU1, then continuously judges what errors are reported by the CPU1, stores the abnormal information in a log and displays the abnormal information in a UI interface. Therefore, a user can directly judge the abnormal node through log information, and maintain the abnormal node in a targeted manner without spending time on the normal node.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. An error reporting collection method for a multi-partition multi-node server, comprising:
acquiring the partition condition of the computing nodes of the multi-path server, and acquiring a main computing node according to the partition condition so as to judge the main BMC and the nodes administered by the main BMC;
the CPU sends the error reporting information to the CPLD of the node;
the CPLD stores the error reporting information into an internal register and sends a signal to the main BMC;
and the main BMC collects the information of the CPLD internal register of the managed node and positions the source node and the CPU of the error reporting information.
2. The method of claim 1, further comprising:
and storing the error reporting information in a system log of the main BMC.
3. The method of claim 1, wherein the obtaining the partition status of the compute nodes of the multi-partition multi-node server comprises:
judging the partition condition of the computing node according to the MODE signal of the jump cap on the management board;
the multi-path server is an eight-path server, and the partition condition of the computing node comprises: single partition, double partition, quad partition.
4. The method of claim 1, further comprising:
and acquiring an MS signal on the mainboard, and acquiring the master-slave relationship of the computing node according to the MS signal.
5. The method of claim 3, wherein the step of obtaining the master computing node according to the partition condition to determine the master BMC and its managed nodes comprises:
the single partition is provided with a main computing node, and the BMC of the main computing node governs error reporting information of other three computing nodes except the main computing node;
the double partitions are provided with two main computing nodes, and the BMCs of the two main computing nodes respectively govern error reporting information of one computing node;
all the computing nodes in the four partitions are main computing nodes, and the respective BMC governs error reporting information on the respective node.
6. The method of claim 1,
the CPU error reporting information comprises: an overheating trigger power-off alarm signal, an operation error signal, a processor overheating signal and a memory overheating signal.
CN202110402331.8A 2021-04-14 2021-04-14 Error reporting collection method of multi-partition multi-node server Pending CN113110953A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110402331.8A CN113110953A (en) 2021-04-14 2021-04-14 Error reporting collection method of multi-partition multi-node server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110402331.8A CN113110953A (en) 2021-04-14 2021-04-14 Error reporting collection method of multi-partition multi-node server

Publications (1)

Publication Number Publication Date
CN113110953A true CN113110953A (en) 2021-07-13

Family

ID=76717244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110402331.8A Pending CN113110953A (en) 2021-04-14 2021-04-14 Error reporting collection method of multi-partition multi-node server

Country Status (1)

Country Link
CN (1) CN113110953A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794033A (en) * 2015-04-29 2015-07-22 浪潮电子信息产业股份有限公司 CPU low-frequency fault positioning method and device based on BMC
US20190286590A1 (en) * 2018-03-14 2019-09-19 Quanta Computer Inc. Cpld cache application in a multi-master topology system
CN111459751A (en) * 2020-03-20 2020-07-28 苏州浪潮智能科技有限公司 High-end server management system
CN112000501A (en) * 2020-08-07 2020-11-27 苏州浪潮智能科技有限公司 Management system for multi-node partition server to access I2C equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794033A (en) * 2015-04-29 2015-07-22 浪潮电子信息产业股份有限公司 CPU low-frequency fault positioning method and device based on BMC
US20190286590A1 (en) * 2018-03-14 2019-09-19 Quanta Computer Inc. Cpld cache application in a multi-master topology system
CN111459751A (en) * 2020-03-20 2020-07-28 苏州浪潮智能科技有限公司 High-end server management system
CN112000501A (en) * 2020-08-07 2020-11-27 苏州浪潮智能科技有限公司 Management system for multi-node partition server to access I2C equipment

Similar Documents

Publication Publication Date Title
TW202041061A (en) System and method for configuration drift detection and remediation
CN103064769B (en) Dual hot standby server system
US20140337496A1 (en) Embedded Management Controller for High-Density Servers
CN102077181A (en) Method and system for generating and delivering inter-processor interrupts in a multi-core processor and in certain shared-memory multi-processor systems
US20140317267A1 (en) High-Density Server Management Controller
US11640377B2 (en) Event-based generation of context-aware telemetry reports
CN105549696B (en) Rack-mounted server system with case management function
CN104750581A (en) Redundant interconnected memory sharing server system
CN111858411A (en) System and method for realizing out-of-band interaction of server Expander cards
CN104049692A (en) Blade server
CN114528234B (en) Out-of-band management method and device for multi-path server system
CN116126772A (en) UART serial port management system and method applied to ARM server
CN113868161B (en) I3C-based device management method, device, equipment and readable medium
TWI777628B (en) Computer system, dedicated crash dump hardware device thereof and method of logging error data
CN113806290A (en) High-integrity system-on-chip for comprehensive modular avionics system
CN113110953A (en) Error reporting collection method of multi-partition multi-node server
CN105471652A (en) Big data all-in-one machine and redundancy management unit thereof
CN203070274U (en) Dual hot standby server system
CN111008165A (en) Four-way server BIOS FLASH control device and method
CN108182163B (en) Computing board level hot plug control device and control method
US11860745B2 (en) Redundant edge hardware
CN108021476B (en) Test method and device of interconnection interface and computing equipment
CN116430962A (en) ARM architecture balanced computing type server framework system
CN216352292U (en) Server mainboard and server
JP3838992B2 (en) Fault detection method and information processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210713