CN103500133A

CN103500133A - Fault locating method and device

Info

Publication number: CN103500133A
Application number: CN201310425373.9A
Authority: CN
Inventors: 刘通良; 姜广吉; 陈俊杰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2013-09-17
Filing date: 2013-09-17
Publication date: 2014-01-08
Also published as: WO2015039598A1

Abstract

The embodiment of the invention provides a fault locating method and device. The fault locating method includes the steps that when a power-on starting device performs a BIOS program of a basic input and output system, monitoring is performed on hardware abnormity triggering conditions; when the hardware abnormity triggering conditions are monitored, abnormity information is collected, and the abnormity information at least comprises fault information of a central processing unit (CPU); the abnormity information is reported to a monitoring server through a network. Examining and accurate locating performed on the fault information are achieved in a remote mode.

Description

Fault Locating Method and device

Technical field

The embodiment of the present invention relates to computer technology, relates in particular to a kind of Fault Locating Method and device.

Background technology

Computer system, especially server product, its reliability is hot issue always.If server breaks down, need to detect in time, locate, get rid of, this all requires do not possess standard cathode ray tube (Cathode Ray Tube at server, be called for short: CRT) or liquid crystal (Liquid Crystal Display, be called for short: LCD) during display function, can easily failure message be presented to the user.Now, (Power On Self-Test, be called for short: POST) fault information collection in stage and location just seem extremely important to power-on self-test.

As shown in Figure 1, mainboard 100 comprises the logic diagram of normatron or server: (Basic Input and Output System is called for short: BIOS) 60 for internal memory 10, CPU20, north bridge 30, south bridge 40, peripheral hardware 50 and Basic Input or Output System (BIOS).In addition, mainboard 100 also is connected with peripheral expansion equipment (hard disk, video card etc.) 70.

In prior art, on the basis of Fig. 1, increase device 80, wherein, install 80 and comprise: storer 81, controller 82 and display module 83, as shown in Figure 2.Use the structure shown in Fig. 2 to carry out the demonstration that following steps realize POST stage self check information:

1) electrifying startup computer equipment, carry out the BIOS60 program;

2) POST stage self check information is sent to controller 82 according to certain data structure;

3) controller 82, the data after decoding, decoding, sends to display module 83, shows in real time POST stage self check information.

The prior art needs extra controller 82 and display module 83, by display module 83, shows POST stage self check information, and this self check information comprises failure message; And obtain this self check informational needs user and come to device context personally.

Summary of the invention

The embodiment of the present invention provides a kind of Fault Locating Method and device, to realize checking and locating of failure message by remote mode.

First aspect, the embodiment of the present invention provides a kind of Fault Locating Method, comprising:

When electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition;

When monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of central processor CPU;

Described abnormal information is reported to monitoring server by network.

In the first of first aspect, in possible implementation, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:

In carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding;

Utilize described entrance function acquisition abnormity information.

The possible implementation according to the first of first aspect, in possible implementation, described in carrying out described bios program process at the second, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding and comprise:

If, while monitoring central processor CPU exception-triggered condition, trigger the entrance function that CPU is corresponding;

If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding;

If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding;

If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.

According to the first of first aspect, first aspect to any one of the possible implementation of the second, at the third in possible implementation, described when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition comprises:

When electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.

In the 4th kind of possible implementation of first aspect, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:

When monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.

In the 5th kind of possible implementation of first aspect, describedly described abnormal information reported to monitoring server by network comprise:

Adopt IPMI IPMI or standard ethernet mode to report described abnormal information to monitoring server.

According to any one of five kinds of possible implementations of the first to the of first aspect, first aspect, in the 6th kind of possible implementation, described described abnormal information is reported to monitoring server by network before, also comprise:

Encapsulate described abnormal information and become capsule, described capsule comprises header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.

Second aspect, the embodiment of the present invention provides a kind of fault locator, comprising:

The monitoring driving module, for when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition;

The abnormal information acquisition module, for when described monitoring driving module monitors during to the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of central processor CPU;

The information reporting module, for reporting monitoring server by described abnormal information by network.

In the first of second aspect, in possible implementation, described abnormal information acquisition module, specifically in carrying out described bios program process, if monitor the exception-triggered condition, triggers the entrance function that each abnormal information trigger source is corresponding; And, utilize described entrance function acquisition abnormity information.

The possible implementation according to the first of second aspect, at the second in possible implementation, described abnormal information acquisition module also for:

If, while monitoring CPU exception-triggered condition, trigger the entrance function that CPU is corresponding;

According to the first of second aspect, second aspect to any one of the possible implementation of the second, at the third in possible implementation, described monitoring driving module is specifically for when electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.

In the 4th kind of possible implementation of second aspect, described abnormal information acquisition module is specifically for when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.

Embodiment of the present invention Fault Locating Method and device, by abnormal information is reported to monitoring server, realize the long-range abnormal information of checking of operating personnel, operating personnel can carry out localization of fault and investigation according to reported abnormal information in the monitoring server side, reduce with the dimension cost.

The accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below will do one to the accompanying drawing of required use in embodiment or description of the Prior Art introduces simply, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

The logic diagram that Fig. 1 is normatron or server;

Fig. 2 is the structural representation shown for POST stage self check information in prior art;

The process flow diagram that Fig. 3 is Fault Locating Method embodiment mono-of the present invention;

The process flow diagram that Fig. 4 is Fault Locating Method embodiment bis-of the present invention;

Fig. 5 is encapsulated format sample figure in Fault Locating Method embodiment bis-of the present invention;

The structural representation that Fig. 6 is fault locator embodiment mono-of the present invention;

The structural representation that Fig. 7 is fault location system embodiment mono-of the present invention.

Embodiment

For the purpose, technical scheme and the advantage that make the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making under the creative work prerequisite the every other embodiment obtained, belong to the scope of protection of the invention.

Along with the widespread use of server, can dispose a large amount of servers in data center or machine room, normally in the machine room outside by remote mode monitoring server running status, therefore, need long-range carry out POST stage fault information collection and monitoring.

The process flow diagram that Fig. 3 is Fault Locating Method embodiment mono-of the present invention.The embodiment of the present invention provides a kind of Fault Locating Method, and the method can be carried out by fault locator, and this device can be integrated in computing machine or server, by software and/or hardware, realizes.As shown in Figure 3, the method for the present embodiment comprises:

Step 301: when electrifying startup equipment is carried out bios program, monitoring hardware anomalies trigger condition.

Generally, in the POST stage, carry out bios program, hardware component to equipment, for example, in the block diagram shown in Fig. 1, (Central Processing Unit is called for short: CPU), the hardware componenies such as internal memory, north bridge and south bridge carry out initialization, and inquire about the duty of those hardware componenies central processing unit.When the POST stage has fault to produce, at first check that whether hardware environment is normal.

In the present embodiment, in the POST stage, carry out bios program, and in bios program the registration call back function, if the hardware environment of equipment occurs abnormal, for example, the CPU initialization exception, illustrated that the exception-triggered condition generates, call the collection of this call back function triggering abnormal information, perform step 302, gather because of the abnormal abnormal information that causes, triggered the subfunction generation by the exception-triggered condition of hardware environment in equipment, wherein, the exception-triggered condition is as the input parameter of this call back function.

Step 302: when monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of CPU.

Particularly, bios program comprises at least one subfunction, and for the collection of abnormal information, those subfunctions are called by above-mentioned call back function.Wherein, failure message comprises the failure message of the hardware componenies such as CPU, internal memory, north bridge, south bridge, and for example, whether failure message Core or the Uncore of CPU, cause cpu fault to judge current failure message, and according to phenomenon of the failure location failure cause; The failure message of north bridge comprises root port (Root Port) and bus and interface standard (Peripheral Component Interface Express, be called for short: the Pcie) failure message of equipment, whether there are north bridge or Pcie equipment failure by those failure message inspections, especially input and output (Input or Output is occurring, be called for short: IO) during mistake, check that the related register information of Pcie is just more particularly important; Whether the failure message of south bridge, occur extremely with the equipment that checks the south bridge carry; The failure message of internal memory, comprise system management interrupt (System Management Interrupt, be called for short: SMI) with Double Data Rate synchronous DRAM (Double Data Rate, be called for short: DDR) failure message, for example, SMI passage error code, DIMM bar ECC mistake or DIMM detect unsuccessfully etc.

Step 303: described abnormal information is reported to monitoring server by network.

Monitoring server is by ethernet communication or LPC communications reception abnormal information, this abnormal information is separated to new record, and preserve abnormal information to local storage medium, this local storage medium includes but not limited to hard disk and nonvolatile random access memory (Non-Volatile Random Access Memory, be called for short: NVRAM), so that fault management and long-time maintenance, as the significant data of follow-up location; Simultaneously, with readable, visual form, inform that operating personnel have fault to produce, the maintainer can also be according to abnormal information, and the inquiry fault database, obtain more detailed localization of fault information.

Separately it should be noted that, operating personnel know that the mode that has fault to produce is arbitrarily, for example, can be the modes that monitoring server is reported to the police, and can be also that the mode that operating personnel pay close attention to is in real time known, at this, are not limited.

In prior art, computing machine or server are crossed display module at POST stage self check information exchange, the demonstration such as the display device such as LCD or VFT, and those display modules are arranged on case front panel, therefore, need operating personnel to come to device context personally.And in the present embodiment, by abnormal information being reported to monitoring server, make the operating personnel can be by the long-range abnormal information of checking of monitoring server.

The embodiment of the present invention, by abnormal information is reported to monitoring server, realizes the long-range abnormal information of checking of operating personnel, and operating personnel can carry out localization of fault and investigation according to reported abnormal information in the monitoring server side, reduces with the dimension cost.

On the basis of above-described embodiment, when monitoring the exception-triggered condition, acquisition abnormity information can be further refined as:

1, in carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding;

2, utilize described entrance function acquisition abnormity information.

Particularly, those skilled in the art can be interpreted as the abnormal information trigger source hardware component of state abnormal, the hardware component that monitors the exception-triggered condition of above mentioning.Input parameter using the exception-triggered condition as the entrance function of its corresponding abnormal information trigger source.

Particularly, in described BIOS chip enable process, if monitor the exception-triggered condition, triggering the entrance function that each abnormal information trigger source is corresponding can comprise:

If, while monitoring central processor CPU exception-triggered condition, trigger the entrance function that CPU is corresponding; If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding; If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding; If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding; By that analogy, if, while monitoring other each hardware anomalies trigger condition, trigger this hardware, entrance function corresponding to abnormal information trigger source repeats herein no longer one by one.

On the basis of the above, when electrifying startup equipment is carried out bios program, monitoring hardware anomalies trigger condition can comprise: when electrifying startup equipment is carried out bios program, whether monitoring generates SMI, anomalous event (Event) or unexpected message (Message), if, determine and monitor the exception-triggered condition, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.Now, the triggering mode that triggers the abnormal information collection according to the exception-triggered condition that monitors comprises: SMI mode, abnormal Event mode or unexpected message mode below to how calling the entrance function that each abnormal information trigger source is corresponding under each triggering mode describe:

Be the SMI mode if monitor the exception-triggered condition, trigger a SMI and interrupt, call the entrance function that each abnormal information trigger source is corresponding in SMI Handler;

Be abnormal Event mode if monitor the exception-triggered condition, send an Event, call the entrance function that each abnormal information trigger source is corresponding in the call back function of Event;

Be the unexpected message mode if monitor the exception-triggered condition, send a message, call the entrance function that each abnormal information trigger source is corresponding in the call back function of message.

Wherein, described when monitoring the exception-triggered condition, acquisition abnormity information can comprise: when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.Generally, if reporting fault information only has been not enough to localization of fault and investigation, therefore need accurate information to assist, complete localization of fault.Therefore, the present invention has introduced the theory of kernel dump in linux OS, and when gathering failure message, the software transfer storehouse relation while gathering the fault generation is also preserved, as the data basis of precise positioning; In addition, also gather the numerical value of current program counter and program status register, preserve the programmable counter of current operation and the numerical value of program status register, be conducive to analyze to occur program operation clue when abnormal and the buffer status correctness of processor, preserve CPU running status when abnormal.

Further, described abnormal information is reported to monitoring server by network can be comprised: adopt IPMI (Intelligent Platform Management Interface, be called for short: IPMI) or the standard ethernet mode report described abnormal information to monitoring server.In addition, can also report abnormal information by other communication mode.

The process flow diagram that Fig. 4 is Fault Locating Method embodiment bis-of the present invention.As shown in Figure 4, the present embodiment is on the basis of above-described embodiment, and Fault Locating Method also can comprise the following steps:

Step 401: when electrifying startup equipment is carried out bios program, monitoring hardware anomalies trigger condition.

This step, with reference to step 301 embodiment illustrated in fig. 3, does not repeat them here.

Step 402: when monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of CPU.

This step, with reference to step 302 embodiment illustrated in fig. 3, does not repeat them here.

Step 403: encapsulate described abnormal information and become capsule.

Fig. 5 is encapsulated format sample figure in Fault Locating Method embodiment bis-of the present invention.Known with reference to Fig. 5, capsule can comprise header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.Wherein, header information and trailer information are indispensable, and the capsule center section can comprise the combination in any of hardware error message, program operation stack information, programmable counter and program status register information.Here, hardware comprises CPU, internal memory, north bridge and south bridge etc.; Program operation stack information comprises the stack information of the intrinsic call function of the stack information of current execution function and current function, wherein, the number of intrinsic call function is unrestricted, when hardware does not have fault to occur, need carry out fault locating analysis by the operation clue of scrutiny program; The numerical value of related register sum counter in the capture program operational process, for example, the numerical value of programmable counter and program status register, for the environmental parameter of analyzing present procedure operation abnormal whether, for example, whether the numerical value of program pointer register is illegal, and whether storehouse overflows etc.

Step 404: described abnormal information is reported to monitoring server by network.

Particularly, after abnormal information is packaged into to capsule or out of Memory form, by IPMI or standard ethernet or other communication mode, report monitoring server; Monitoring server is resolved the abnormal information received according to corresponding encapsulation format, obtain respectively the information such as hardware, storehouse, programmable counter.

In the present embodiment, the software transfer storehouse relation while occurring by collection failure message, fault and the numerical value of current program counter and program status register etc., provide more detailed localization of fault information, further guarantees the reliability of localization of fault.

Technical scheme of the present invention can be used for the research and development of products stage, and accurate failure message can be accelerated the location of fault in the research and development of computer system/product, reduces R&D costs, and guarantees product quality; Technical scheme of the present invention also can be used for the product O&M stage, and failure message accurately reduces the difficulty of O&M.

The structural representation that Fig. 6 is fault locator embodiment mono-of the present invention, as shown in Figure 6, the device of the present embodiment comprises: monitoring driving module 61, abnormal information acquisition module 62 and information reporting module 63.

Wherein, monitoring driving module 61, for when electrifying startup equipment is carried out the basic input-output system BIOS program, is monitored the hardware anomalies trigger condition; Abnormal information acquisition module 62 is for when described monitoring driving module monitors during to the exception-triggered condition, acquisition abnormity information, and described abnormal information at least comprises the failure message of CPU; Information reporting module 63 is for reporting monitoring server by described abnormal information by network.

The fault locator of the present embodiment, can be for the technical scheme of embodiment of the method shown in execution graph 1, its realize principle and technique effect similar, repeat no more herein.

In the above-described embodiments, abnormal information acquisition module 62 can, specifically in carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding; And, utilize described entrance function acquisition abnormity information.

On the basis of the above, abnormal information acquisition module 62 can also for: if, while monitoring CPU exception-triggered condition, trigger the entrance function that CPU is corresponding; If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding; If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding; If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.

Further, monitoring driving module 61 can be specifically for when electrifying startup equipment be carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.

Preferably, abnormal information acquisition module 62 can be specifically for when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.

On the basis of the above, information reporting module 63 can report described abnormal information to monitoring server specifically for adopting IPMI IPMI or standard ethernet mode.

On the basis of the above, information reporting module 63 can also for: encapsulate described abnormal information and become capsule, described capsule comprises header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.

The fault locator of the present embodiment, can be for carrying out the technical scheme of above-mentioned either method embodiment, its realize principle and technique effect similar, repeat no more herein.

The structural representation that Fig. 7 is fault location system embodiment mono-of the present invention, as shown in Figure 7, the system of the present embodiment comprises: mainboard 100, fault locator 110 and monitoring server 200.Wherein, mainboard 100 can adopt the logic diagram of the normatron shown in Fig. 1 or server; Fault locator 110 can adopt the structure of Fig. 6 shown device embodiment, is integrated in the BIOS60 in mainboard 100, and it can carry out the technical scheme of above-mentioned either method embodiment accordingly, its realize principle and technique effect similar, repeat no more herein; Integrated abnormal information parsing module 210 in monitoring server 200, the abnormal information that this abnormal information parsing module 210 reports for resolve fault locating device 110 information reporting modules 113; 200 dotted lines of mainboard 100 and monitoring server mean wireless connections, and the two is by ethernet communication or LPC communication.

One of ordinary skill in the art will appreciate that: realize that the hardware that all or part of step of above-mentioned each embodiment of the method can be relevant by programmed instruction completes.Aforesaid program can be stored in a computer read/write memory medium.This program, when carrying out, is carried out the step that comprises above-mentioned each embodiment of the method; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.

Finally it should be noted that: above each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to aforementioned each embodiment, the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: its technical scheme that still can put down in writing aforementioned each embodiment is modified, or some or all of technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the scope of various embodiments of the present invention technical scheme.

Claims

1. a Fault Locating Method, is characterized in that, comprising:

Described abnormal information is reported to monitoring server by network.

2. method according to claim 1, is characterized in that, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:

Utilize described entrance function acquisition abnormity information.

3. method according to claim 2, is characterized in that, described in carrying out described bios program process, if monitor the exception-triggered condition, triggers the entrance function that each abnormal information trigger source is corresponding and comprise:

4. according to claim 1 or 2 or 3 described methods, it is characterized in that, described when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition comprises:

5. method according to claim 1, is characterized in that, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:

6. method according to claim 1, is characterized in that, describedly described abnormal information is reported to monitoring server by network comprises:

7. according to the described method of claim 1-6 any one, it is characterized in that, described described abnormal information is reported to monitoring server by network before, also comprise:

8. a fault locator, is characterized in that, comprising:

9. device according to claim 8, is characterized in that, described abnormal information acquisition module, specifically in carrying out described bios program process, if monitor the exception-triggered condition, triggers the entrance function that each abnormal information trigger source is corresponding; And, utilize described entrance function acquisition abnormity information.

10. device according to claim 9, is characterized in that, described abnormal information acquisition module also for:

11. according to Claim 8 or 9 or 10 described devices, it is characterized in that, described monitoring driving module is specifically for when electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine and monitor the exception-triggered condition, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.

12. device according to claim 8, it is characterized in that, described abnormal information acquisition module is specifically for when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.