CN102420710A - Method for positioning fault of server cluster system - Google Patents

Method for positioning fault of server cluster system Download PDF

Info

Publication number
CN102420710A
CN102420710A CN2011104600595A CN201110460059A CN102420710A CN 102420710 A CN102420710 A CN 102420710A CN 2011104600595 A CN2011104600595 A CN 2011104600595A CN 201110460059 A CN201110460059 A CN 201110460059A CN 102420710 A CN102420710 A CN 102420710A
Authority
CN
China
Prior art keywords
server
signal
cluster system
cluster
fault locating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104600595A
Other languages
Chinese (zh)
Inventor
张考华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Co Ltd
Original Assignee
Dawning Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Co Ltd filed Critical Dawning Information Industry Co Ltd
Priority to CN2011104600595A priority Critical patent/CN102420710A/en
Publication of CN102420710A publication Critical patent/CN102420710A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method for positioning a fault of a server cluster system. The server cluster system comprises a first server, a transmission channel and a second server, wherein the first server and the second server respectively operate in a unified extensible firmware interface (UEFI) environment; and a signal which is sent to the second server by the first server through the transmission channel is different from the signal which is received by the second server. The method comprises the following steps that: 1, the first server sends a first signal to the second server with an Ethernet network card; and 2, the second server sends the first signal to the first server in reverse direction with an external loop-back function of the Ethernet network card. By adoption of the method for positioning the fault of the server cluster system in the UEFI environment, dependence of network diagnosis on an operating system environment can be eliminated, and interference of different server nodes in a cluster on fault positioning caused by difference in operating systems and operation of various kinds of application programs on the operating systems can be eliminated, so that the fault can be accurately and efficiently positioned. By debugging connection of a cluster network beforehand in the environment of independence of the operating systems, conditions can be provided for subsequent remote deployment and installation of the operating systems on each node of the whole cluster, so that the deployment speed of the whole cluster is greatly improved.

Description

The server cluster system Fault Locating Method
Technical field
The present invention is the design server field basically, more specifically, designs a kind of server cluster system Fault Locating Method.
Background technology
Server is as very important nucleus equipment in the integrated network system, and its environment for use be unable to do without network environment.Present server set group network, often tens of by at least, thousands of server is formed at most.When actual deployment, its operating system can not be one one and go manual installation, but depends on reliable and stable network environment, concentrated automatically by computer lab management software and installs.Before operating system installation, in case network failure occurs, we can find that operable positioning analysis means are relatively very deficient like this.
When machine room carried out actual server cluster deployment, we ran into such-and-such network problem through regular meeting for these.To the location of these network problems, be basically under the applied environment of operating system at present, utilize corresponding diagnosis debugging acid to carry out.Because these diagnosis debugging acids all depend on operating system, for different operating systems, though procotol is a standard, diagnosis debugging acid itself is to the processing of message, and all there is certain difference in parsing.Add the influence of other related softwares under the operating system environment, the positioning analysis of problem is caused interference through regular meeting.
Prior art provides the method for a kind of Long-distance Control and diagnosing fault of server power supply; It is switch through the program interface Control Server power module at telemanagement center; Check operating state, rotation speed of the fan, temperature, electric current, the power data information of server power supply, diagnose power supply to have or not damage effectively.This prior art has improved efficient to a certain extent.
Yet; Above-mentioned prior art can only be used for the inner problem of diagnosis server; Then can't be applied to the communication failure between the server in the diagnosis server group system, also existing diagnostic method all to run under the operating system, and when not having installing operating system, just can't diagnose.
Summary of the invention
Defective according to above-mentioned prior art; The invention provides a kind of server cluster system Fault Locating Method; Through this method; How to have solved the technical problem that the fault to the server in the aggregated server system positions, particularly solved how in the technical problem that does not have to diagnose for server under the situation of installing operating system.
According to an aspect of the present invention; A kind of server cluster system Fault Locating Method is provided; Said server cluster system comprises first server, transmission channel and second server; Said first server and said second server all operate under the UEFI environment; Said first server is different with the signal that said second server receives to the signal that said second server sends through said transmission channel, and it is characterized in that said method comprises: step S1: the said first server via Ethernet network interface card sends first signal to said second server; Said second server sends it back said first server to said first signals reverse through the external loop fuction of ethernet nic; Step S2: said first server receives secondary signal; And step S3: the abort situation of confirming said server cluster system through more said first signal and said secondary signal.
In this server cluster system Fault Locating Method, said step S3 comprises: if said first signal is identical with said secondary signal, then fault occurs in said second server.
In this server cluster system Fault Locating Method, said step S3 comprises: if said first signal is different with said secondary signal, then fault occurs in said first server or said transmission channel.
In this server cluster system Fault Locating Method, not installing operating system and application software in said first server or the said second server.
In this server cluster system Fault Locating Method, said step S1 comprises: step S11: import information that first signal or said administration module gather the coupled functional module that connects as first signal to said administration module; And step S12: said first server sends to said second server through ethernet nic with said first signal under the UEFI environment; Said second server sends it back said first server to said first signals reverse through the external loop fuction of ethernet nic.
In this server cluster system Fault Locating Method, at least one during said step S1 further may further comprise the steps: through said administration module inquiry help information; Generate first message through said administration module, and said first message is sent to said first server; Generate second message through said administration module, and said second message is sent to said second server; Dispose the parameter of said first server through said administration module; And the parameter that disposes said second server through said administration module.
In this server cluster system Fault Locating Method, said administration module is a computer,
In this server cluster system Fault Locating Method, said functional module is said first server.
In this server cluster system Fault Locating Method, said transmission channel is an Ethernet.
Through above-mentioned server cluster system Fault Locating Method; Can get rid of the dependence of network diagnosis to operating system environment; Can get rid of simultaneously different server node in the cluster because operating system difference reaches the various application programs of operation above the operating system, to the interference of fault location; Make accurate positioning, efficient.Under the environment that does not rely on operating system, in advance debug the connectivity of cluster network simultaneously, can carry out remote operating system deployment installation to each node of whole cluster condition is provided for follow-up, thereby improve the deployment speed of whole cluster greatly.
Description of drawings
Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of specification, is used to explain the present invention with embodiments of the invention, is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the overview flow chart according to server cluster system Fault Locating Method of the present invention;
Fig. 2 is the particular flow sheet according to server cluster system Fault Locating Method of the present invention;
Fig. 3 is the particular flow sheet according to the instance of server cluster system Fault Locating Method of the present invention.
Embodiment
Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for explanation and explains the present invention, and be not used in qualification the present invention.
Fig. 1 is the overview flow chart according to server cluster system Fault Locating Method of the present invention.Server cluster system using server cluster system Fault Locating Method illustrated in fig. 1 comprises first server, transmission channel and second server; Wherein, First server and second server all operate in UEFI (Unified Extensible Firmware Interface; General extensibility firmware interface) under the software environment, do not get into operating system.First server is different with the signal that second server receives to the signal that second server sends through transmission channel; That is to say; Being connected of this first server, transmission channel and second server has fault, need position this fault through Fault Locating Method, in Fig. 1:
Step S100: the first server via Ethernet network interface card sends first signal to second server, and second server sends it back first server to this first signals reverse through the external loop fuction of ethernet nic.Under the actual application environment server cluster system, software environment is very complicated often, does not say various application programs, only just possibly there are all kinds simultaneously in operating system itself and unify the various version of type.And in contrast, for the hardware environment of server cluster corresponding will be simply unified many.And replacing the UEFI conduct of traditional B IOS and the system firmware that hardware is closely related, its version kind also can be much relatively simply unified.The present invention utilizes UEFI more powerful relatively than traditional B IOS just, simultaneously again than the operating system running environment of " totally ".SHELL environment through UEFI positions the fault of server cluster system.
Step S102: first server receives secondary signal.
Step S104: confirm the abort situation of server cluster system through comparing first signal and secondary signal.Through judging that if first signal is identical with secondary signal, then fault occurs in second server; If first signal is different with secondary signal, then fault occurs in first server or transmission channel.Preferably, transmission channel is an Ethernet.
Through above-mentioned server cluster system Fault Locating Method; Can get rid of the dependence of network diagnosis to operating system environment; Can get rid of simultaneously different server node in the cluster because operating system difference reaches the various application programs of operation above the operating system, to the interference of fault location; Make accurate positioning, efficient.Under the environment that does not rely on operating system, in advance debug the connectivity of cluster network simultaneously, can carry out remote operating system deployment installation to each node of whole cluster condition is provided for follow-up, thereby improve the deployment speed of whole cluster greatly.
Fig. 2 is the particular flow sheet according to server cluster system Fault Locating Method of the present invention.In Fig. 2:
Step S200: import information that first signal or administration module gather the coupled functional module that connects as first signal to administration module.Preferably, this administration module is a computer, is preferably notebook computer.This functional module is first server.In one embodiment, can instruction be input in the notebook computer through keyboard, be first signal by notebook computer through treatment conversion.In another embodiment; Can notebook computer be connected with blade server; The Information Monitoring from first server of this notebook computer (such as; Chassis information, blade information, power information, system fan information, low speed switching module information, high speed switching module information, memory module information or the like), be first signal with this information translation then.
Step S202: first server sends to second server through ethernet nic with first signal under the UEFI environment, and second server sends it back first server to first signals reverse through the external loop fuction of ethernet nic.Wherein, not installing operating system and application software in this first server or the second server.
Step S204: first server receives secondary signal.
Step S206: judge whether first signal is identical with secondary signal.If first signal is identical with secondary signal, then this method proceeds to step S208, and promptly fault occurs in second server; If first signal is different with secondary signal, then this method proceeds to step S210, and fault occurs in first server or said transmission channel.
Through above-mentioned server cluster system Fault Locating Method; Can get rid of the dependence of network diagnosis to operating system environment; Can get rid of simultaneously different server node in the cluster because operating system difference reaches the various application programs of operation above the operating system, to the interference of fault location; Make accurate positioning, efficient.Under the environment that does not rely on operating system, in advance debug the connectivity of cluster network simultaneously, can carry out remote operating system deployment installation to each node of whole cluster condition is provided for follow-up, thereby improve the deployment speed of whole cluster greatly.
According to method involved in the present invention, can be in operating system for before installing, or operating system can't the situation of operate as normal under, for server cluster system provides a kind of effective failure diagnosis means.This will bring into play very big effect in large server clustered deploy(ment) process.With dawn 6000 (nebula) project is example, if in such large server cluster, can before operating system installation, can there be means to carry out the network debugging, networked devices such as cooperation switch have been divided the network segment.So the follow-up network connectivity problem that causes just can well in time be found, and dwindled fault coverage.
Simultaneously can install the condition that provides concentratedly, so just can save early stage in a large number and on production line, carry out the plenty of time that system installs on a small scale by the workman for the operating system in later stage.Can reduce the project cycle greatly.
Fig. 3 is the particular flow sheet according to the instance of server cluster system Fault Locating Method of the present invention.According to shown in Figure 3; Instruction is sent on the notebook computer (administration module),, can carry out following operation: check help information through the analysis of notebook computer; Carry out outloop (as stated), generate message, transmission/reception message, dispose the server that is connected with this notebook.
Through above-mentioned server cluster system Fault Locating Method; Can get rid of the dependence of network diagnosis to operating system environment; Can get rid of simultaneously different server node in the cluster because operating system difference reaches the various application programs of operation above the operating system, to the interference of fault location; Make accurate positioning, efficient.Under the environment that does not rely on operating system, in advance debug the connectivity of cluster network simultaneously, can carry out remote operating system deployment installation to each node of whole cluster condition is provided for follow-up, thereby improve the deployment speed of whole cluster greatly.
According to method involved in the present invention, can be in operating system for before installing, or operating system can't the situation of operate as normal under, for server cluster system provides a kind of effective failure diagnosis means.This will bring into play very big effect in large server clustered deploy(ment) process.With dawn 6000 (nebula) project is example, if in such large server cluster, can before operating system installation, can there be means to carry out the network debugging, networked devices such as cooperation switch have been divided the network segment.So the follow-up network connectivity problem that causes just can well in time be found, and dwindled fault coverage.
Simultaneously can install the condition that provides concentratedly, so just can save early stage in a large number and on production line, carry out the plenty of time that system installs on a small scale by the workman for the operating system in later stage.Can reduce the project cycle greatly.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1. server cluster system Fault Locating Method; Said server cluster system comprises first server, transmission channel and second server; Said first server and said second server all operate under the UEFI environment; Said first server is different with the signal that said second server receives to the signal that said second server sends through said transmission channel, it is characterized in that said method comprises:
Step S1: the said first server via Ethernet network interface card sends first signal to said second server, and said second server sends it back said first server to said first signals reverse through the external loop fuction of ethernet nic;
Step S2: said first server receives secondary signal; And
Step S3: the abort situation of confirming said server cluster system through more said first signal and said secondary signal.
2. server cluster system Fault Locating Method according to claim 1 is characterized in that, said step S3 comprises: if said first signal is identical with said secondary signal, then fault occurs in said second server.
3. server cluster system Fault Locating Method according to claim 1 is characterized in that, said step S3 comprises: if said first signal is different with said secondary signal, then fault occurs in said first server or said transmission channel.
4. according to claim 2 or 3 described server cluster system Fault Locating Methods, it is characterized in that not installing operating system and application software in said first server or the said second server.
5. server cluster system Fault Locating Method according to claim 4 is characterized in that, said step S1 comprises:
Step S11: import information that first signal or said administration module gather the coupled functional module that connects as first signal to said administration module; And
Step S12: said first server sends to said second server through ethernet nic with said first signal under the UEFI environment; Said second server sends it back said first server to said first signals reverse through the external loop fuction of ethernet nic.
6. server cluster system Fault Locating Method according to claim 5 is characterized in that, at least one during said step S1 further may further comprise the steps:
Through said administration module inquiry help information;
Generate first message through said administration module, and said first message is sent to said first server;
Generate second message through said administration module, and said second message is sent to said second server;
Dispose the parameter of said first server through said administration module; And
Dispose the parameter of said second server through said administration module.
7. server cluster system Fault Locating Method according to claim 6 is characterized in that, said administration module is a computer,
8. server cluster system Fault Locating Method according to claim 7 is characterized in that, said functional module is said first server.
9. server cluster system Fault Locating Method according to claim 8 is characterized in that, said transmission channel is an Ethernet.
CN2011104600595A 2011-12-31 2011-12-31 Method for positioning fault of server cluster system Pending CN102420710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104600595A CN102420710A (en) 2011-12-31 2011-12-31 Method for positioning fault of server cluster system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104600595A CN102420710A (en) 2011-12-31 2011-12-31 Method for positioning fault of server cluster system

Publications (1)

Publication Number Publication Date
CN102420710A true CN102420710A (en) 2012-04-18

Family

ID=45944958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104600595A Pending CN102420710A (en) 2011-12-31 2011-12-31 Method for positioning fault of server cluster system

Country Status (1)

Country Link
CN (1) CN102420710A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103973480A (en) * 2014-04-09 2014-08-06 汉柏科技有限公司 Device and method for increasing cloud computing system user response speed
CN104468725A (en) * 2014-11-06 2015-03-25 浪潮(北京)电子信息产业有限公司 High-availability cluster software maintaining method, device and system
CN104812066A (en) * 2015-05-18 2015-07-29 百度在线网络技术(北京)有限公司 Method and device for identifying and positioning faults and server
CN110262968A (en) * 2019-06-10 2019-09-20 天翼电子商务有限公司 Promote method, system, medium and the electronic equipment of application failure location efficiency

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1859411A (en) * 2006-03-18 2006-11-08 华为技术有限公司 Method for detecting communication device link ringback and communication device
CN101821990A (en) * 2007-10-09 2010-09-01 Lm爱立信电话有限公司 Arrangement and method for handling failures in network
US20100281295A1 (en) * 2008-01-07 2010-11-04 Abhay Karandikar Method for Fast Connectivity Fault Management [CFM] of a Service-Network
CN102075988A (en) * 2009-11-24 2011-05-25 中国移动通信集团浙江有限公司 System and method for locating end-to-end voice quality fault in mobile communication network
CN102147763A (en) * 2010-02-05 2011-08-10 中国长城计算机深圳股份有限公司 Method, system and computer for recording weblog

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1859411A (en) * 2006-03-18 2006-11-08 华为技术有限公司 Method for detecting communication device link ringback and communication device
CN101821990A (en) * 2007-10-09 2010-09-01 Lm爱立信电话有限公司 Arrangement and method for handling failures in network
US20100281295A1 (en) * 2008-01-07 2010-11-04 Abhay Karandikar Method for Fast Connectivity Fault Management [CFM] of a Service-Network
CN102075988A (en) * 2009-11-24 2011-05-25 中国移动通信集团浙江有限公司 System and method for locating end-to-end voice quality fault in mobile communication network
CN102147763A (en) * 2010-02-05 2011-08-10 中国长城计算机深圳股份有限公司 Method, system and computer for recording weblog

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103973480A (en) * 2014-04-09 2014-08-06 汉柏科技有限公司 Device and method for increasing cloud computing system user response speed
CN103973480B (en) * 2014-04-09 2017-10-31 汉柏科技有限公司 Improve the device and method of cloud computing system user reponding time
CN104468725A (en) * 2014-11-06 2015-03-25 浪潮(北京)电子信息产业有限公司 High-availability cluster software maintaining method, device and system
CN104468725B (en) * 2014-11-06 2017-12-01 浪潮(北京)电子信息产业有限公司 A kind of method, apparatus and system for realizing high-availability cluster software maintenance
CN104812066A (en) * 2015-05-18 2015-07-29 百度在线网络技术(北京)有限公司 Method and device for identifying and positioning faults and server
CN110262968A (en) * 2019-06-10 2019-09-20 天翼电子商务有限公司 Promote method, system, medium and the electronic equipment of application failure location efficiency

Similar Documents

Publication Publication Date Title
CN110380907B (en) Network fault diagnosis method and device, network equipment and storage medium
CN102047683B (en) Dynamic fault analysis for a centrally managed network element in a telecommunications system
CN103457761B (en) Cross-platform command line configuration interface implementation method
CN102571498A (en) Fault injection control method and device
CN105656685A (en) Automatic deployment and operation and maintenance monitoring method based on zabbix system oracle
CN111176939B (en) Multi-node server management system and method based on CPLD
CN102420710A (en) Method for positioning fault of server cluster system
EP3051750B1 (en) Collection adaptor management method and system
WO2017193763A1 (en) Testing method, apparatus and system
CN102325036A (en) Fault diagnosis method for network system, system and device
US8816695B2 (en) Method and system for interoperability testing
CN111052087A (en) Control system, information processing device, and abnormality factor estimation program
CN101938369B (en) Comprehensive network management access management system, management method and network management system applying same
CN103605592A (en) Mechanism of detecting malfunctions of distributed computer system
CN111800299A (en) Operation maintenance system and method of edge cloud
CN105404569A (en) Method for testing remote Power Reset of server
CN112636960B (en) Intranet collaborative maintenance method, system, device, server and storage medium of edge computing equipment
CN107707408B (en) Remote monitoring method and system for digital broadcast transmitter
CN106411643B (en) BMC detection method and device
CN109446002B (en) Jig plate, system and method for grabbing SATA hard disk by server
CN116137603A (en) Link fault detection method and device, storage medium and electronic device
US20160156501A1 (en) Network apparatus with inserted management mechanism, system, and method for management and supervision
US11457374B2 (en) Hub device with diagnostic function and diagnostic method using the same
CN112206453A (en) Fire control system
CN106021649A (en) A model configuration detector used for a virtual circuit verifying platform and a control method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120418