CN105357064A

CN105357064A - Node fault recording method of high-end fault tolerance server

Info

Publication number: CN105357064A
Application number: CN201510931667.8A
Authority: CN
Inventors: 黄家明; 乔英良; 李冠广
Original assignee: Shandong Mass Institute Of Information Technology
Current assignee: Shandong Mass Institute Of Information Technology
Priority date: 2015-12-15
Filing date: 2015-12-15
Publication date: 2016-02-24

Abstract

The invention discloses a node fault recording method of a high-end fault tolerance server. The specific design process comprises the steps of: firstly, defining a fault management LOG level of an RMC aiming at internal nodes of a rack, wherein the RMC is a Racks Management Controller, and LOG is a log file; secondly, defining fault management LOG content of the RMC aiming at the internal nodes of the rack; and finally defining fault management LOG triggering conditions of the RMC aiming at the internal nodes of the rack. Compared with the prior art, node fault LOG recording and fault management of the RMC aiming at the rack level can effectively finished, a user is enabled to conveniently manage fault information of the high-end fault tolerance server as managing a single server, the practicality is high, and the popularization is easy.

Description

A kind of node failure recording method of high-end fault-tolerant server

Technical field

The present invention relates to computer server technical field, specifically a kind of node failure recording method of practical, high-end fault-tolerant server.

Background technology

Along with user is to the raising of the computation requirement of computer, user requires more and more higher to the calculated performance of single computer.High-end fault-tolerant server is a multipath server.Compared with traditional server, with in reliability, there is compared with traditional server very large advantage in calculated performance, and apply more and more extensive in the field to real-time, reliabilty and availability requirement harshness.High-end fault-tolerant server collects multiple computing node in rack simultaneously.Whole machine cabinet is by unified fault management and the failure logging carrying out node of RMC, and the number of nodes of RMC management is various, and high-end fault-tolerant server proposes new challenge with fault note to the fault management of node compared with traditional server.

At present in the fault management and fault note content of high-end fault-tolerant server, particular content is also indefinite, if traditionally carried out under server, because number of nodes is various, is difficult to the requirement meeting management.

Summary of the invention

Technical assignment of the present invention is for above weak point, provides a kind of node failure recording method of practical, high-end fault-tolerant server.

A node failure recording method for high-end fault-tolerant server, its specific design process is:

First define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;

Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;

Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.

Described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.

Described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.

Described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.

The daily record trigger event of described Info rank comprises: node recovers from fault, the BMC address of node changes, node is inserted into, Node B MC is restarted, node is started shooting, node shutdown.

The daily record trigger event of described warning rank comprises: node memory capacity changes, the SSD hard-disk capacity of node changes, node hard-disk capacity changes, node is removed, node is not in place, node rack position changes, node ID position changes, node temperature is more than warning threshold value.

The daily record trigger event of described critical rank comprises: nodes break down, node communication link break down, node temperature is more than critical threshold value.

The node failure recording method of a kind of high-end fault-tolerant server of the present invention, has the following advantages:

The node failure recording method of a kind of high-end fault-tolerant server of the present invention, provides the recording method of high-end fault-tolerant server node failure, is applicable to the requirement of high-end fault-tolerant server; The RMC of high-end fault-tolerant server makes fault management and the fault LOG writing task of all computing nodes, effectively can complete RMC to the record of other node failure of cabinet-level LOG and fault management, make the fault message of the high-end fault-tolerant server of user management, convenient as management single server, practical, be easy to promote.

Embodiment

Below in conjunction with specific embodiment, the invention will be further described.

The invention provides a kind of node failure recording method of high-end fault-tolerant server, relate to a kind of method of failure logging and the main contents of failure logging of node of high-end fault-tolerant server.Mainly for the feature of the centralized management of high-end fault-tolerant server node and differentiated control, the equipment such as computing node are various.RMC is to the record more complicated of node failure.

By the node failure recording method of a kind of high-end fault-tolerant server of the present invention, the RMC of high-end fault-tolerant server makes fault management and the fault LOG writing task of all computing nodes, effectively can complete RMC to the record of other node failure of cabinet-level LOG and fault management, make the fault message of the high-end fault-tolerant server of user management, convenient as management single server.

Its specific design process is:

The daily record trigger event of described Info rank comprises:

WasOK, node recovers from fault;

The BMC address of BMCIPModechangetoStatic node changes;

Wasadded node is inserted into;

BMCreset Node B MC is restarted;

Waspoweron node is started shooting;

Waspoweroff node shuts down.

The daily record trigger event of described warning rank comprises:

Memorycapacitywaschangedtoxxx node memory capacity changes;

The SSD hard-disk capacity of SSDdiskcapacitywaschangedtoxxx node changes;

HDDdiskcapacitywaschangedtoxxx node hard-disk capacity changes;

Wasremoved node is removed;

Wasabsent node is not in place;

Wasfromxxxx (locationID) toxxxx node rack position changes;

IDfromxxxtoyyy node ID position changes;

Ambienttemperaturewasxxxoverxxxdegree. node temperature is more than warning threshold value.

The daily record trigger event of described critical rank comprises:

Wasfail nodes break down;

Communicationwasfail node communication link breaks down;

Ambienttemperaturewasxxxoverxxxdegree node temperature is more than critical threshold value.

Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of the node failure recording method of any a kind of high-end fault-tolerant server according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims

1. the node failure recording method of a high-end fault-tolerant server, it is characterized in that, its specific design process is: first define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;

2. the node failure recording method of a kind of high-end fault-tolerant server according to claim 1, it is characterized in that, described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.

3. the node failure recording method of a kind of high-end fault-tolerant server according to claim 1, is characterized in that, described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.

4. the node failure recording method of a kind of high-end fault-tolerant server according to claim 2, it is characterized in that, described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.

5. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described Info rank comprises: node recovers from fault, node BMC address changes, node is inserted into, Node B MC is restarted, node start, node shutdown.

6. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described warning rank comprises: node memory capacity changes, the SSD hard-disk capacity of node changes, node hard-disk capacity changes, node is removed, node is not in place, node rack position changes, node ID position changes, node temperature is more than warning threshold value.

7. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described critical rank comprises: nodes break down, node communication link break down, node temperature is more than critical threshold value.