CN105357064A - Node fault recording method of high-end fault tolerance server - Google Patents

Node fault recording method of high-end fault tolerance server Download PDF

Info

Publication number
CN105357064A
CN105357064A CN201510931667.8A CN201510931667A CN105357064A CN 105357064 A CN105357064 A CN 105357064A CN 201510931667 A CN201510931667 A CN 201510931667A CN 105357064 A CN105357064 A CN 105357064A
Authority
CN
China
Prior art keywords
node
fault
log
critical
rmc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510931667.8A
Other languages
Chinese (zh)
Inventor
黄家明
乔英良
李冠广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Mass Institute Of Information Technology
Original Assignee
Shandong Mass Institute Of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Mass Institute Of Information Technology filed Critical Shandong Mass Institute Of Information Technology
Priority to CN201510931667.8A priority Critical patent/CN105357064A/en
Publication of CN105357064A publication Critical patent/CN105357064A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • G06F11/2242Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors in multi-processor systems, e.g. one processor becoming the test master
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery

Abstract

The invention discloses a node fault recording method of a high-end fault tolerance server. The specific design process comprises the steps of: firstly, defining a fault management LOG level of an RMC aiming at internal nodes of a rack, wherein the RMC is a Racks Management Controller, and LOG is a log file; secondly, defining fault management LOG content of the RMC aiming at the internal nodes of the rack; and finally defining fault management LOG triggering conditions of the RMC aiming at the internal nodes of the rack. Compared with the prior art, node fault LOG recording and fault management of the RMC aiming at the rack level can effectively finished, a user is enabled to conveniently manage fault information of the high-end fault tolerance server as managing a single server, the practicality is high, and the popularization is easy.

Description

A kind of node failure recording method of high-end fault-tolerant server
Technical field
The present invention relates to computer server technical field, specifically a kind of node failure recording method of practical, high-end fault-tolerant server.
Background technology
Along with user is to the raising of the computation requirement of computer, user requires more and more higher to the calculated performance of single computer.High-end fault-tolerant server is a multipath server.Compared with traditional server, with in reliability, there is compared with traditional server very large advantage in calculated performance, and apply more and more extensive in the field to real-time, reliabilty and availability requirement harshness.High-end fault-tolerant server collects multiple computing node in rack simultaneously.Whole machine cabinet is by unified fault management and the failure logging carrying out node of RMC, and the number of nodes of RMC management is various, and high-end fault-tolerant server proposes new challenge with fault note to the fault management of node compared with traditional server.
At present in the fault management and fault note content of high-end fault-tolerant server, particular content is also indefinite, if traditionally carried out under server, because number of nodes is various, is difficult to the requirement meeting management.
Summary of the invention
Technical assignment of the present invention is for above weak point, provides a kind of node failure recording method of practical, high-end fault-tolerant server.
A node failure recording method for high-end fault-tolerant server, its specific design process is:
First define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;
Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;
Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.
Described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.
Described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.
Described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.
The daily record trigger event of described Info rank comprises: node recovers from fault, the BMC address of node changes, node is inserted into, Node B MC is restarted, node is started shooting, node shutdown.
The daily record trigger event of described warning rank comprises: node memory capacity changes, the SSD hard-disk capacity of node changes, node hard-disk capacity changes, node is removed, node is not in place, node rack position changes, node ID position changes, node temperature is more than warning threshold value.
The daily record trigger event of described critical rank comprises: nodes break down, node communication link break down, node temperature is more than critical threshold value.
The node failure recording method of a kind of high-end fault-tolerant server of the present invention, has the following advantages:
The node failure recording method of a kind of high-end fault-tolerant server of the present invention, provides the recording method of high-end fault-tolerant server node failure, is applicable to the requirement of high-end fault-tolerant server; The RMC of high-end fault-tolerant server makes fault management and the fault LOG writing task of all computing nodes, effectively can complete RMC to the record of other node failure of cabinet-level LOG and fault management, make the fault message of the high-end fault-tolerant server of user management, convenient as management single server, practical, be easy to promote.
Embodiment
Below in conjunction with specific embodiment, the invention will be further described.
The invention provides a kind of node failure recording method of high-end fault-tolerant server, relate to a kind of method of failure logging and the main contents of failure logging of node of high-end fault-tolerant server.Mainly for the feature of the centralized management of high-end fault-tolerant server node and differentiated control, the equipment such as computing node are various.RMC is to the record more complicated of node failure.
By the node failure recording method of a kind of high-end fault-tolerant server of the present invention, the RMC of high-end fault-tolerant server makes fault management and the fault LOG writing task of all computing nodes, effectively can complete RMC to the record of other node failure of cabinet-level LOG and fault management, make the fault message of the high-end fault-tolerant server of user management, convenient as management single server.
Its specific design process is:
First define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;
Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;
Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.
Described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.
Described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.
Described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.
The daily record trigger event of described Info rank comprises:
WasOK, node recovers from fault;
The BMC address of BMCIPModechangetoStatic node changes;
Wasadded node is inserted into;
BMCreset Node B MC is restarted;
Waspoweron node is started shooting;
Waspoweroff node shuts down.
The daily record trigger event of described warning rank comprises:
Memorycapacitywaschangedtoxxx node memory capacity changes;
The SSD hard-disk capacity of SSDdiskcapacitywaschangedtoxxx node changes;
HDDdiskcapacitywaschangedtoxxx node hard-disk capacity changes;
Wasremoved node is removed;
Wasabsent node is not in place;
Wasfromxxxx (locationID) toxxxx node rack position changes;
IDfromxxxtoyyy node ID position changes;
Ambienttemperaturewasxxxoverxxxdegree. node temperature is more than warning threshold value.
The daily record trigger event of described critical rank comprises:
Wasfail nodes break down;
Communicationwasfail node communication link breaks down;
Ambienttemperaturewasxxxoverxxxdegree node temperature is more than critical threshold value.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of the node failure recording method of any a kind of high-end fault-tolerant server according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims (7)

1. the node failure recording method of a high-end fault-tolerant server, it is characterized in that, its specific design process is: first define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;
Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;
Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.
2. the node failure recording method of a kind of high-end fault-tolerant server according to claim 1, it is characterized in that, described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.
3. the node failure recording method of a kind of high-end fault-tolerant server according to claim 1, is characterized in that, described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.
4. the node failure recording method of a kind of high-end fault-tolerant server according to claim 2, it is characterized in that, described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.
5. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described Info rank comprises: node recovers from fault, node BMC address changes, node is inserted into, Node B MC is restarted, node start, node shutdown.
6. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described warning rank comprises: node memory capacity changes, the SSD hard-disk capacity of node changes, node hard-disk capacity changes, node is removed, node is not in place, node rack position changes, node ID position changes, node temperature is more than warning threshold value.
7. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described critical rank comprises: nodes break down, node communication link break down, node temperature is more than critical threshold value.
CN201510931667.8A 2015-12-15 2015-12-15 Node fault recording method of high-end fault tolerance server Pending CN105357064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510931667.8A CN105357064A (en) 2015-12-15 2015-12-15 Node fault recording method of high-end fault tolerance server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510931667.8A CN105357064A (en) 2015-12-15 2015-12-15 Node fault recording method of high-end fault tolerance server

Publications (1)

Publication Number Publication Date
CN105357064A true CN105357064A (en) 2016-02-24

Family

ID=55332940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510931667.8A Pending CN105357064A (en) 2015-12-15 2015-12-15 Node fault recording method of high-end fault tolerance server

Country Status (1)

Country Link
CN (1) CN105357064A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108616428A (en) * 2018-05-14 2018-10-02 郑州云海信息技术有限公司 A kind of mobile APP implementations of remote management RACK computer rooms
CN111581002A (en) * 2020-04-29 2020-08-25 上海中通吉网络技术有限公司 Automatic fault reporting method, device and equipment for server fault

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7296172B2 (en) * 2004-08-24 2007-11-13 Inventec Corporation Power control and management method for uninterruptible power system and servers
CN104317714A (en) * 2014-10-29 2015-01-28 浪潮电子信息产业股份有限公司 Method for automatically testing stability of rack based on expect
CN104378218A (en) * 2013-08-12 2015-02-25 鸿富锦精密工业(深圳)有限公司 System and method for managing servers in cabinet
CN104809041A (en) * 2015-05-07 2015-07-29 浪潮电子信息产业股份有限公司 Batch test method of whole cabinet server power supply

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7296172B2 (en) * 2004-08-24 2007-11-13 Inventec Corporation Power control and management method for uninterruptible power system and servers
CN104378218A (en) * 2013-08-12 2015-02-25 鸿富锦精密工业(深圳)有限公司 System and method for managing servers in cabinet
CN104317714A (en) * 2014-10-29 2015-01-28 浪潮电子信息产业股份有限公司 Method for automatically testing stability of rack based on expect
CN104809041A (en) * 2015-05-07 2015-07-29 浪潮电子信息产业股份有限公司 Batch test method of whole cabinet server power supply

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
天蝎项目组: ""整机柜服务器解决方案技术规范"", 《百度文库》 *
文档助手1: "历史版本1:log输出级别", 《FINEREPORT帮助文档》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108616428A (en) * 2018-05-14 2018-10-02 郑州云海信息技术有限公司 A kind of mobile APP implementations of remote management RACK computer rooms
CN111581002A (en) * 2020-04-29 2020-08-25 上海中通吉网络技术有限公司 Automatic fault reporting method, device and equipment for server fault

Similar Documents

Publication Publication Date Title
CN106557145A (en) Circuit breaking protective system and its method
CN104461747B (en) A kind of distributed task dispatching system
GB201306798D0 (en) Storage management in clustered data processing systems
CN105373899A (en) Server asset management method and apparatus
CN102857371B (en) A kind of dynamic allocation management method towards group system
CN107656705B (en) Computer storage medium and data migration method, device and system
CN103257908A (en) Software and hardware cooperative multi-controller disk array designing method
US9208039B2 (en) System and method for detecting server removal from a cluster to enable fast failover of storage
US9372756B2 (en) Recovery of operational state values for complex event processing based on a time window defined by an event query
TW201635142A (en) Fault tolerant method and system for multiple servers
CN104346264A (en) System and method for processing system event logs
CN112118130B (en) Self-adaptive distributed cache active-standby state information switching method and device
CN104484131A (en) Device and corresponding method for processing data of multi-disk servers
CN107229537A (en) A kind of database real time backup method
CN105357064A (en) Node fault recording method of high-end fault tolerance server
CN106201772A (en) The backup of a kind of operating system based on data center, restoration methods and device
CN102999399A (en) Method and device of automatically restoring storage of JBOD (just bundle of disks) array
CN109474470A (en) One kind is from monitoring method and device
CN106126368A (en) A kind of method of memory failure address resolution under LINUX
CN103309764A (en) Method and device for protection of fault-tolerant mechanism of virtual machine
CN103500140A (en) Method for rapidly learning invalidation of distributed cluster nodes
CN107943615B (en) Data processing method and system based on distributed cluster
CN105323271B (en) Cloud computing system and processing method and device thereof
CN109271270A (en) The troubleshooting methodology, system and relevant apparatus of bottom hardware in storage system
US20140052807A1 (en) Server and method for controlling sharing of fans

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160224