CN105357064A - Node fault recording method of high-end fault tolerance server - Google Patents
Node fault recording method of high-end fault tolerance server Download PDFInfo
- Publication number
- CN105357064A CN105357064A CN201510931667.8A CN201510931667A CN105357064A CN 105357064 A CN105357064 A CN 105357064A CN 201510931667 A CN201510931667 A CN 201510931667A CN 105357064 A CN105357064 A CN 105357064A
- Authority
- CN
- China
- Prior art keywords
- node
- fault
- log
- critical
- rmc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/2236—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
- G06F11/2242—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors in multi-processor systems, e.g. one processor becoming the test master
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
Abstract
The invention discloses a node fault recording method of a high-end fault tolerance server. The specific design process comprises the steps of: firstly, defining a fault management LOG level of an RMC aiming at internal nodes of a rack, wherein the RMC is a Racks Management Controller, and LOG is a log file; secondly, defining fault management LOG content of the RMC aiming at the internal nodes of the rack; and finally defining fault management LOG triggering conditions of the RMC aiming at the internal nodes of the rack. Compared with the prior art, node fault LOG recording and fault management of the RMC aiming at the rack level can effectively finished, a user is enabled to conveniently manage fault information of the high-end fault tolerance server as managing a single server, the practicality is high, and the popularization is easy.
Description
Technical field
The present invention relates to computer server technical field, specifically a kind of node failure recording method of practical, high-end fault-tolerant server.
Background technology
Along with user is to the raising of the computation requirement of computer, user requires more and more higher to the calculated performance of single computer.High-end fault-tolerant server is a multipath server.Compared with traditional server, with in reliability, there is compared with traditional server very large advantage in calculated performance, and apply more and more extensive in the field to real-time, reliabilty and availability requirement harshness.High-end fault-tolerant server collects multiple computing node in rack simultaneously.Whole machine cabinet is by unified fault management and the failure logging carrying out node of RMC, and the number of nodes of RMC management is various, and high-end fault-tolerant server proposes new challenge with fault note to the fault management of node compared with traditional server.
At present in the fault management and fault note content of high-end fault-tolerant server, particular content is also indefinite, if traditionally carried out under server, because number of nodes is various, is difficult to the requirement meeting management.
Summary of the invention
Technical assignment of the present invention is for above weak point, provides a kind of node failure recording method of practical, high-end fault-tolerant server.
A node failure recording method for high-end fault-tolerant server, its specific design process is:
First define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;
Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;
Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.
Described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.
Described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.
Described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.
The daily record trigger event of described Info rank comprises: node recovers from fault, the BMC address of node changes, node is inserted into, Node B MC is restarted, node is started shooting, node shutdown.
The daily record trigger event of described warning rank comprises: node memory capacity changes, the SSD hard-disk capacity of node changes, node hard-disk capacity changes, node is removed, node is not in place, node rack position changes, node ID position changes, node temperature is more than warning threshold value.
The daily record trigger event of described critical rank comprises: nodes break down, node communication link break down, node temperature is more than critical threshold value.
The node failure recording method of a kind of high-end fault-tolerant server of the present invention, has the following advantages:
The node failure recording method of a kind of high-end fault-tolerant server of the present invention, provides the recording method of high-end fault-tolerant server node failure, is applicable to the requirement of high-end fault-tolerant server; The RMC of high-end fault-tolerant server makes fault management and the fault LOG writing task of all computing nodes, effectively can complete RMC to the record of other node failure of cabinet-level LOG and fault management, make the fault message of the high-end fault-tolerant server of user management, convenient as management single server, practical, be easy to promote.
Embodiment
Below in conjunction with specific embodiment, the invention will be further described.
The invention provides a kind of node failure recording method of high-end fault-tolerant server, relate to a kind of method of failure logging and the main contents of failure logging of node of high-end fault-tolerant server.Mainly for the feature of the centralized management of high-end fault-tolerant server node and differentiated control, the equipment such as computing node are various.RMC is to the record more complicated of node failure.
By the node failure recording method of a kind of high-end fault-tolerant server of the present invention, the RMC of high-end fault-tolerant server makes fault management and the fault LOG writing task of all computing nodes, effectively can complete RMC to the record of other node failure of cabinet-level LOG and fault management, make the fault message of the high-end fault-tolerant server of user management, convenient as management single server.
Its specific design process is:
First define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;
Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;
Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.
Described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.
Described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.
Described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.
The daily record trigger event of described Info rank comprises:
WasOK, node recovers from fault;
The BMC address of BMCIPModechangetoStatic node changes;
Wasadded node is inserted into;
BMCreset Node B MC is restarted;
Waspoweron node is started shooting;
Waspoweroff node shuts down.
The daily record trigger event of described warning rank comprises:
Memorycapacitywaschangedtoxxx node memory capacity changes;
The SSD hard-disk capacity of SSDdiskcapacitywaschangedtoxxx node changes;
HDDdiskcapacitywaschangedtoxxx node hard-disk capacity changes;
Wasremoved node is removed;
Wasabsent node is not in place;
Wasfromxxxx (locationID) toxxxx node rack position changes;
IDfromxxxtoyyy node ID position changes;
Ambienttemperaturewasxxxoverxxxdegree. node temperature is more than warning threshold value.
The daily record trigger event of described critical rank comprises:
Wasfail nodes break down;
Communicationwasfail node communication link breaks down;
Ambienttemperaturewasxxxoverxxxdegree node temperature is more than critical threshold value.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of the node failure recording method of any a kind of high-end fault-tolerant server according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.
Claims (7)
1. the node failure recording method of a high-end fault-tolerant server, it is characterized in that, its specific design process is: first define RMC to interior of equipment cabinet node failure management LOG rank, RMC is here equipment cabinet management controller RacksManagementController, LOG is journal file;
Secondly definition RMC is to interior of equipment cabinet node failure management LOG content;
Finally define the trigger condition of RMC to interior of equipment cabinet node failure management LOG.
2. the node failure recording method of a kind of high-end fault-tolerant server according to claim 1, it is characterized in that, described management LOG rank comprises three part: information Info, alarm warning and critical critical, and wherein information Info is the malfunction recovery of node and the normal operation information of user; Alarm warning refers to assets information change, the removing with the temperature of node more than warning threshold value of node of node; Critical critical refers to one malfunctions, and node temperature goes wrong more than critical alarm threshold value and node link.
3. the node failure recording method of a kind of high-end fault-tolerant server according to claim 1, is characterized in that, described LOG content comprises: the specific descriptions EventDescription of the rank of LOG, the type NODE of equipment, node failure.
4. the node failure recording method of a kind of high-end fault-tolerant server according to claim 2, it is characterized in that, described trigger condition refers to by reading node failure admin log file LOG content, tells the fault of above-mentioned information Info rank, the fault of alarm warning rank or the fault of critical critical rank according to content.
5. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described Info rank comprises: node recovers from fault, node BMC address changes, node is inserted into, Node B MC is restarted, node start, node shutdown.
6. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described warning rank comprises: node memory capacity changes, the SSD hard-disk capacity of node changes, node hard-disk capacity changes, node is removed, node is not in place, node rack position changes, node ID position changes, node temperature is more than warning threshold value.
7. the node failure recording method of a kind of high-end fault-tolerant server according to claim 4, it is characterized in that, the daily record trigger event of described critical rank comprises: nodes break down, node communication link break down, node temperature is more than critical threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510931667.8A CN105357064A (en) | 2015-12-15 | 2015-12-15 | Node fault recording method of high-end fault tolerance server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510931667.8A CN105357064A (en) | 2015-12-15 | 2015-12-15 | Node fault recording method of high-end fault tolerance server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105357064A true CN105357064A (en) | 2016-02-24 |
Family
ID=55332940
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510931667.8A Pending CN105357064A (en) | 2015-12-15 | 2015-12-15 | Node fault recording method of high-end fault tolerance server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105357064A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108616428A (en) * | 2018-05-14 | 2018-10-02 | 郑州云海信息技术有限公司 | A kind of mobile APP implementations of remote management RACK computer rooms |
CN111581002A (en) * | 2020-04-29 | 2020-08-25 | 上海中通吉网络技术有限公司 | Automatic fault reporting method, device and equipment for server fault |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7296172B2 (en) * | 2004-08-24 | 2007-11-13 | Inventec Corporation | Power control and management method for uninterruptible power system and servers |
CN104317714A (en) * | 2014-10-29 | 2015-01-28 | 浪潮电子信息产业股份有限公司 | Method for automatically testing stability of rack based on expect |
CN104378218A (en) * | 2013-08-12 | 2015-02-25 | 鸿富锦精密工业(深圳)有限公司 | System and method for managing servers in cabinet |
CN104809041A (en) * | 2015-05-07 | 2015-07-29 | 浪潮电子信息产业股份有限公司 | Batch test method of whole cabinet server power supply |
-
2015
- 2015-12-15 CN CN201510931667.8A patent/CN105357064A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7296172B2 (en) * | 2004-08-24 | 2007-11-13 | Inventec Corporation | Power control and management method for uninterruptible power system and servers |
CN104378218A (en) * | 2013-08-12 | 2015-02-25 | 鸿富锦精密工业(深圳)有限公司 | System and method for managing servers in cabinet |
CN104317714A (en) * | 2014-10-29 | 2015-01-28 | 浪潮电子信息产业股份有限公司 | Method for automatically testing stability of rack based on expect |
CN104809041A (en) * | 2015-05-07 | 2015-07-29 | 浪潮电子信息产业股份有限公司 | Batch test method of whole cabinet server power supply |
Non-Patent Citations (2)
Title |
---|
天蝎项目组: ""整机柜服务器解决方案技术规范"", 《百度文库》 * |
文档助手1: "历史版本1:log输出级别", 《FINEREPORT帮助文档》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108616428A (en) * | 2018-05-14 | 2018-10-02 | 郑州云海信息技术有限公司 | A kind of mobile APP implementations of remote management RACK computer rooms |
CN111581002A (en) * | 2020-04-29 | 2020-08-25 | 上海中通吉网络技术有限公司 | Automatic fault reporting method, device and equipment for server fault |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106557145A (en) | Circuit breaking protective system and its method | |
CN104461747B (en) | A kind of distributed task dispatching system | |
GB201306798D0 (en) | Storage management in clustered data processing systems | |
CN105373899A (en) | Server asset management method and apparatus | |
CN102857371B (en) | A kind of dynamic allocation management method towards group system | |
CN107656705B (en) | Computer storage medium and data migration method, device and system | |
CN103257908A (en) | Software and hardware cooperative multi-controller disk array designing method | |
US9208039B2 (en) | System and method for detecting server removal from a cluster to enable fast failover of storage | |
US9372756B2 (en) | Recovery of operational state values for complex event processing based on a time window defined by an event query | |
TW201635142A (en) | Fault tolerant method and system for multiple servers | |
CN104346264A (en) | System and method for processing system event logs | |
CN112118130B (en) | Self-adaptive distributed cache active-standby state information switching method and device | |
CN104484131A (en) | Device and corresponding method for processing data of multi-disk servers | |
CN107229537A (en) | A kind of database real time backup method | |
CN105357064A (en) | Node fault recording method of high-end fault tolerance server | |
CN106201772A (en) | The backup of a kind of operating system based on data center, restoration methods and device | |
CN102999399A (en) | Method and device of automatically restoring storage of JBOD (just bundle of disks) array | |
CN109474470A (en) | One kind is from monitoring method and device | |
CN106126368A (en) | A kind of method of memory failure address resolution under LINUX | |
CN103309764A (en) | Method and device for protection of fault-tolerant mechanism of virtual machine | |
CN103500140A (en) | Method for rapidly learning invalidation of distributed cluster nodes | |
CN107943615B (en) | Data processing method and system based on distributed cluster | |
CN105323271B (en) | Cloud computing system and processing method and device thereof | |
CN109271270A (en) | The troubleshooting methodology, system and relevant apparatus of bottom hardware in storage system | |
US20140052807A1 (en) | Server and method for controlling sharing of fans |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160224 |