CN103500133A - Fault locating method and device - Google Patents

Fault locating method and device Download PDF

Info

Publication number
CN103500133A
CN103500133A CN201310425373.9A CN201310425373A CN103500133A CN 103500133 A CN103500133 A CN 103500133A CN 201310425373 A CN201310425373 A CN 201310425373A CN 103500133 A CN103500133 A CN 103500133A
Authority
CN
China
Prior art keywords
monitoring
information
trigger
exception
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310425373.9A
Other languages
Chinese (zh)
Inventor
刘通良
姜广吉
陈俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201310425373.9A priority Critical patent/CN103500133A/en
Publication of CN103500133A publication Critical patent/CN103500133A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2284Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by power-on test, e.g. power-on self test [POST]

Abstract

The embodiment of the invention provides a fault locating method and device. The fault locating method includes the steps that when a power-on starting device performs a BIOS program of a basic input and output system, monitoring is performed on hardware abnormity triggering conditions; when the hardware abnormity triggering conditions are monitored, abnormity information is collected, and the abnormity information at least comprises fault information of a central processing unit (CPU); the abnormity information is reported to a monitoring server through a network. Examining and accurate locating performed on the fault information are achieved in a remote mode.

Description

Fault Locating Method and device
Technical field
The embodiment of the present invention relates to computer technology, relates in particular to a kind of Fault Locating Method and device.
Background technology
Computer system, especially server product, its reliability is hot issue always.If server breaks down, need to detect in time, locate, get rid of, this all requires do not possess standard cathode ray tube (Cathode Ray Tube at server, be called for short: CRT) or liquid crystal (Liquid Crystal Display, be called for short: LCD) during display function, can easily failure message be presented to the user.Now, (Power On Self-Test, be called for short: POST) fault information collection in stage and location just seem extremely important to power-on self-test.
As shown in Figure 1, mainboard 100 comprises the logic diagram of normatron or server: (Basic Input and Output System is called for short: BIOS) 60 for internal memory 10, CPU20, north bridge 30, south bridge 40, peripheral hardware 50 and Basic Input or Output System (BIOS).In addition, mainboard 100 also is connected with peripheral expansion equipment (hard disk, video card etc.) 70.
In prior art, on the basis of Fig. 1, increase device 80, wherein, install 80 and comprise: storer 81, controller 82 and display module 83, as shown in Figure 2.Use the structure shown in Fig. 2 to carry out the demonstration that following steps realize POST stage self check information:
1) electrifying startup computer equipment, carry out the BIOS60 program;
2) POST stage self check information is sent to controller 82 according to certain data structure;
3) controller 82, the data after decoding, decoding, sends to display module 83, shows in real time POST stage self check information.
The prior art needs extra controller 82 and display module 83, by display module 83, shows POST stage self check information, and this self check information comprises failure message; And obtain this self check informational needs user and come to device context personally.
Summary of the invention
The embodiment of the present invention provides a kind of Fault Locating Method and device, to realize checking and locating of failure message by remote mode.
First aspect, the embodiment of the present invention provides a kind of Fault Locating Method, comprising:
When electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition;
When monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of central processor CPU;
Described abnormal information is reported to monitoring server by network.
In the first of first aspect, in possible implementation, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:
In carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding;
Utilize described entrance function acquisition abnormity information.
The possible implementation according to the first of first aspect, in possible implementation, described in carrying out described bios program process at the second, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding and comprise:
If, while monitoring central processor CPU exception-triggered condition, trigger the entrance function that CPU is corresponding;
If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding;
If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding;
If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.
According to the first of first aspect, first aspect to any one of the possible implementation of the second, at the third in possible implementation, described when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition comprises:
When electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.
In the 4th kind of possible implementation of first aspect, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:
When monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.
In the 5th kind of possible implementation of first aspect, describedly described abnormal information reported to monitoring server by network comprise:
Adopt IPMI IPMI or standard ethernet mode to report described abnormal information to monitoring server.
According to any one of five kinds of possible implementations of the first to the of first aspect, first aspect, in the 6th kind of possible implementation, described described abnormal information is reported to monitoring server by network before, also comprise:
Encapsulate described abnormal information and become capsule, described capsule comprises header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.
Second aspect, the embodiment of the present invention provides a kind of fault locator, comprising:
The monitoring driving module, for when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition;
The abnormal information acquisition module, for when described monitoring driving module monitors during to the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of central processor CPU;
The information reporting module, for reporting monitoring server by described abnormal information by network.
In the first of second aspect, in possible implementation, described abnormal information acquisition module, specifically in carrying out described bios program process, if monitor the exception-triggered condition, triggers the entrance function that each abnormal information trigger source is corresponding; And, utilize described entrance function acquisition abnormity information.
The possible implementation according to the first of second aspect, at the second in possible implementation, described abnormal information acquisition module also for:
If, while monitoring CPU exception-triggered condition, trigger the entrance function that CPU is corresponding;
If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding;
If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding;
If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.
According to the first of second aspect, second aspect to any one of the possible implementation of the second, at the third in possible implementation, described monitoring driving module is specifically for when electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.
In the 4th kind of possible implementation of second aspect, described abnormal information acquisition module is specifically for when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.
Embodiment of the present invention Fault Locating Method and device, by abnormal information is reported to monitoring server, realize the long-range abnormal information of checking of operating personnel, operating personnel can carry out localization of fault and investigation according to reported abnormal information in the monitoring server side, reduce with the dimension cost.
The accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below will do one to the accompanying drawing of required use in embodiment or description of the Prior Art introduces simply, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The logic diagram that Fig. 1 is normatron or server;
Fig. 2 is the structural representation shown for POST stage self check information in prior art;
The process flow diagram that Fig. 3 is Fault Locating Method embodiment mono-of the present invention;
The process flow diagram that Fig. 4 is Fault Locating Method embodiment bis-of the present invention;
Fig. 5 is encapsulated format sample figure in Fault Locating Method embodiment bis-of the present invention;
The structural representation that Fig. 6 is fault locator embodiment mono-of the present invention;
The structural representation that Fig. 7 is fault location system embodiment mono-of the present invention.
Embodiment
For the purpose, technical scheme and the advantage that make the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making under the creative work prerequisite the every other embodiment obtained, belong to the scope of protection of the invention.
Along with the widespread use of server, can dispose a large amount of servers in data center or machine room, normally in the machine room outside by remote mode monitoring server running status, therefore, need long-range carry out POST stage fault information collection and monitoring.
The process flow diagram that Fig. 3 is Fault Locating Method embodiment mono-of the present invention.The embodiment of the present invention provides a kind of Fault Locating Method, and the method can be carried out by fault locator, and this device can be integrated in computing machine or server, by software and/or hardware, realizes.As shown in Figure 3, the method for the present embodiment comprises:
Step 301: when electrifying startup equipment is carried out bios program, monitoring hardware anomalies trigger condition.
Generally, in the POST stage, carry out bios program, hardware component to equipment, for example, in the block diagram shown in Fig. 1, (Central Processing Unit is called for short: CPU), the hardware componenies such as internal memory, north bridge and south bridge carry out initialization, and inquire about the duty of those hardware componenies central processing unit.When the POST stage has fault to produce, at first check that whether hardware environment is normal.
In the present embodiment, in the POST stage, carry out bios program, and in bios program the registration call back function, if the hardware environment of equipment occurs abnormal, for example, the CPU initialization exception, illustrated that the exception-triggered condition generates, call the collection of this call back function triggering abnormal information, perform step 302, gather because of the abnormal abnormal information that causes, triggered the subfunction generation by the exception-triggered condition of hardware environment in equipment, wherein, the exception-triggered condition is as the input parameter of this call back function.
Step 302: when monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of CPU.
Particularly, bios program comprises at least one subfunction, and for the collection of abnormal information, those subfunctions are called by above-mentioned call back function.Wherein, failure message comprises the failure message of the hardware componenies such as CPU, internal memory, north bridge, south bridge, and for example, whether failure message Core or the Uncore of CPU, cause cpu fault to judge current failure message, and according to phenomenon of the failure location failure cause; The failure message of north bridge comprises root port (Root Port) and bus and interface standard (Peripheral Component Interface Express, be called for short: the Pcie) failure message of equipment, whether there are north bridge or Pcie equipment failure by those failure message inspections, especially input and output (Input or Output is occurring, be called for short: IO) during mistake, check that the related register information of Pcie is just more particularly important; Whether the failure message of south bridge, occur extremely with the equipment that checks the south bridge carry; The failure message of internal memory, comprise system management interrupt (System Management Interrupt, be called for short: SMI) with Double Data Rate synchronous DRAM (Double Data Rate, be called for short: DDR) failure message, for example, SMI passage error code, DIMM bar ECC mistake or DIMM detect unsuccessfully etc.
Step 303: described abnormal information is reported to monitoring server by network.
Monitoring server is by ethernet communication or LPC communications reception abnormal information, this abnormal information is separated to new record, and preserve abnormal information to local storage medium, this local storage medium includes but not limited to hard disk and nonvolatile random access memory (Non-Volatile Random Access Memory, be called for short: NVRAM), so that fault management and long-time maintenance, as the significant data of follow-up location; Simultaneously, with readable, visual form, inform that operating personnel have fault to produce, the maintainer can also be according to abnormal information, and the inquiry fault database, obtain more detailed localization of fault information.
Separately it should be noted that, operating personnel know that the mode that has fault to produce is arbitrarily, for example, can be the modes that monitoring server is reported to the police, and can be also that the mode that operating personnel pay close attention to is in real time known, at this, are not limited.
In prior art, computing machine or server are crossed display module at POST stage self check information exchange, the demonstration such as the display device such as LCD or VFT, and those display modules are arranged on case front panel, therefore, need operating personnel to come to device context personally.And in the present embodiment, by abnormal information being reported to monitoring server, make the operating personnel can be by the long-range abnormal information of checking of monitoring server.
The embodiment of the present invention, by abnormal information is reported to monitoring server, realizes the long-range abnormal information of checking of operating personnel, and operating personnel can carry out localization of fault and investigation according to reported abnormal information in the monitoring server side, reduces with the dimension cost.
On the basis of above-described embodiment, when monitoring the exception-triggered condition, acquisition abnormity information can be further refined as:
1, in carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding;
2, utilize described entrance function acquisition abnormity information.
Particularly, those skilled in the art can be interpreted as the abnormal information trigger source hardware component of state abnormal, the hardware component that monitors the exception-triggered condition of above mentioning.Input parameter using the exception-triggered condition as the entrance function of its corresponding abnormal information trigger source.
Particularly, in described BIOS chip enable process, if monitor the exception-triggered condition, triggering the entrance function that each abnormal information trigger source is corresponding can comprise:
If, while monitoring central processor CPU exception-triggered condition, trigger the entrance function that CPU is corresponding; If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding; If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding; If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding; By that analogy, if, while monitoring other each hardware anomalies trigger condition, trigger this hardware, entrance function corresponding to abnormal information trigger source repeats herein no longer one by one.
On the basis of the above, when electrifying startup equipment is carried out bios program, monitoring hardware anomalies trigger condition can comprise: when electrifying startup equipment is carried out bios program, whether monitoring generates SMI, anomalous event (Event) or unexpected message (Message), if, determine and monitor the exception-triggered condition, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.Now, the triggering mode that triggers the abnormal information collection according to the exception-triggered condition that monitors comprises: SMI mode, abnormal Event mode or unexpected message mode below to how calling the entrance function that each abnormal information trigger source is corresponding under each triggering mode describe:
Be the SMI mode if monitor the exception-triggered condition, trigger a SMI and interrupt, call the entrance function that each abnormal information trigger source is corresponding in SMI Handler;
Be abnormal Event mode if monitor the exception-triggered condition, send an Event, call the entrance function that each abnormal information trigger source is corresponding in the call back function of Event;
Be the unexpected message mode if monitor the exception-triggered condition, send a message, call the entrance function that each abnormal information trigger source is corresponding in the call back function of message.
Wherein, described when monitoring the exception-triggered condition, acquisition abnormity information can comprise: when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.Generally, if reporting fault information only has been not enough to localization of fault and investigation, therefore need accurate information to assist, complete localization of fault.Therefore, the present invention has introduced the theory of kernel dump in linux OS, and when gathering failure message, the software transfer storehouse relation while gathering the fault generation is also preserved, as the data basis of precise positioning; In addition, also gather the numerical value of current program counter and program status register, preserve the programmable counter of current operation and the numerical value of program status register, be conducive to analyze to occur program operation clue when abnormal and the buffer status correctness of processor, preserve CPU running status when abnormal.
Further, described abnormal information is reported to monitoring server by network can be comprised: adopt IPMI (Intelligent Platform Management Interface, be called for short: IPMI) or the standard ethernet mode report described abnormal information to monitoring server.In addition, can also report abnormal information by other communication mode.
The process flow diagram that Fig. 4 is Fault Locating Method embodiment bis-of the present invention.As shown in Figure 4, the present embodiment is on the basis of above-described embodiment, and Fault Locating Method also can comprise the following steps:
Step 401: when electrifying startup equipment is carried out bios program, monitoring hardware anomalies trigger condition.
This step, with reference to step 301 embodiment illustrated in fig. 3, does not repeat them here.
Step 402: when monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of CPU.
This step, with reference to step 302 embodiment illustrated in fig. 3, does not repeat them here.
Step 403: encapsulate described abnormal information and become capsule.
Fig. 5 is encapsulated format sample figure in Fault Locating Method embodiment bis-of the present invention.Known with reference to Fig. 5, capsule can comprise header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.Wherein, header information and trailer information are indispensable, and the capsule center section can comprise the combination in any of hardware error message, program operation stack information, programmable counter and program status register information.Here, hardware comprises CPU, internal memory, north bridge and south bridge etc.; Program operation stack information comprises the stack information of the intrinsic call function of the stack information of current execution function and current function, wherein, the number of intrinsic call function is unrestricted, when hardware does not have fault to occur, need carry out fault locating analysis by the operation clue of scrutiny program; The numerical value of related register sum counter in the capture program operational process, for example, the numerical value of programmable counter and program status register, for the environmental parameter of analyzing present procedure operation abnormal whether, for example, whether the numerical value of program pointer register is illegal, and whether storehouse overflows etc.
Step 404: described abnormal information is reported to monitoring server by network.
Particularly, after abnormal information is packaged into to capsule or out of Memory form, by IPMI or standard ethernet or other communication mode, report monitoring server; Monitoring server is resolved the abnormal information received according to corresponding encapsulation format, obtain respectively the information such as hardware, storehouse, programmable counter.
In the present embodiment, the software transfer storehouse relation while occurring by collection failure message, fault and the numerical value of current program counter and program status register etc., provide more detailed localization of fault information, further guarantees the reliability of localization of fault.
Technical scheme of the present invention can be used for the research and development of products stage, and accurate failure message can be accelerated the location of fault in the research and development of computer system/product, reduces R&D costs, and guarantees product quality; Technical scheme of the present invention also can be used for the product O&M stage, and failure message accurately reduces the difficulty of O&M.
The structural representation that Fig. 6 is fault locator embodiment mono-of the present invention, as shown in Figure 6, the device of the present embodiment comprises: monitoring driving module 61, abnormal information acquisition module 62 and information reporting module 63.
Wherein, monitoring driving module 61, for when electrifying startup equipment is carried out the basic input-output system BIOS program, is monitored the hardware anomalies trigger condition; Abnormal information acquisition module 62 is for when described monitoring driving module monitors during to the exception-triggered condition, acquisition abnormity information, and described abnormal information at least comprises the failure message of CPU; Information reporting module 63 is for reporting monitoring server by described abnormal information by network.
The fault locator of the present embodiment, can be for the technical scheme of embodiment of the method shown in execution graph 1, its realize principle and technique effect similar, repeat no more herein.
In the above-described embodiments, abnormal information acquisition module 62 can, specifically in carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding; And, utilize described entrance function acquisition abnormity information.
On the basis of the above, abnormal information acquisition module 62 can also for: if, while monitoring CPU exception-triggered condition, trigger the entrance function that CPU is corresponding; If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding; If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding; If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.
Further, monitoring driving module 61 can be specifically for when electrifying startup equipment be carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.
Preferably, abnormal information acquisition module 62 can be specifically for when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.
On the basis of the above, information reporting module 63 can report described abnormal information to monitoring server specifically for adopting IPMI IPMI or standard ethernet mode.
On the basis of the above, information reporting module 63 can also for: encapsulate described abnormal information and become capsule, described capsule comprises header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.
The fault locator of the present embodiment, can be for carrying out the technical scheme of above-mentioned either method embodiment, its realize principle and technique effect similar, repeat no more herein.
The structural representation that Fig. 7 is fault location system embodiment mono-of the present invention, as shown in Figure 7, the system of the present embodiment comprises: mainboard 100, fault locator 110 and monitoring server 200.Wherein, mainboard 100 can adopt the logic diagram of the normatron shown in Fig. 1 or server; Fault locator 110 can adopt the structure of Fig. 6 shown device embodiment, is integrated in the BIOS60 in mainboard 100, and it can carry out the technical scheme of above-mentioned either method embodiment accordingly, its realize principle and technique effect similar, repeat no more herein; Integrated abnormal information parsing module 210 in monitoring server 200, the abnormal information that this abnormal information parsing module 210 reports for resolve fault locating device 110 information reporting modules 113; 200 dotted lines of mainboard 100 and monitoring server mean wireless connections, and the two is by ethernet communication or LPC communication.
One of ordinary skill in the art will appreciate that: realize that the hardware that all or part of step of above-mentioned each embodiment of the method can be relevant by programmed instruction completes.Aforesaid program can be stored in a computer read/write memory medium.This program, when carrying out, is carried out the step that comprises above-mentioned each embodiment of the method; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.
Finally it should be noted that: above each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to aforementioned each embodiment, the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: its technical scheme that still can put down in writing aforementioned each embodiment is modified, or some or all of technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the scope of various embodiments of the present invention technical scheme.

Claims (12)

1. a Fault Locating Method, is characterized in that, comprising:
When electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition;
When monitoring the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of central processor CPU;
Described abnormal information is reported to monitoring server by network.
2. method according to claim 1, is characterized in that, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:
In carrying out described bios program process, if monitor the exception-triggered condition, trigger the entrance function that each abnormal information trigger source is corresponding;
Utilize described entrance function acquisition abnormity information.
3. method according to claim 2, is characterized in that, described in carrying out described bios program process, if monitor the exception-triggered condition, triggers the entrance function that each abnormal information trigger source is corresponding and comprise:
If, while monitoring central processor CPU exception-triggered condition, trigger the entrance function that CPU is corresponding;
If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding;
If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding;
If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.
4. according to claim 1 or 2 or 3 described methods, it is characterized in that, described when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition comprises:
When electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine the exception-triggered condition that monitors, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.
5. method according to claim 1, is characterized in that, described when monitoring the exception-triggered condition, acquisition abnormity information comprises:
When monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.
6. method according to claim 1, is characterized in that, describedly described abnormal information is reported to monitoring server by network comprises:
Adopt IPMI IPMI or standard ethernet mode to report described abnormal information to monitoring server.
7. according to the described method of claim 1-6 any one, it is characterized in that, described described abnormal information is reported to monitoring server by network before, also comprise:
Encapsulate described abnormal information and become capsule, described capsule comprises header information, hardware error message, program operation stack information, programmable counter, program status register information and trailer information.
8. a fault locator, is characterized in that, comprising:
The monitoring driving module, for when electrifying startup equipment is carried out the basic input-output system BIOS program, monitoring hardware anomalies trigger condition;
The abnormal information acquisition module, for when described monitoring driving module monitors during to the exception-triggered condition, acquisition abnormity information, described abnormal information at least comprises the failure message of central processor CPU;
The information reporting module, for reporting monitoring server by described abnormal information by network.
9. device according to claim 8, is characterized in that, described abnormal information acquisition module, specifically in carrying out described bios program process, if monitor the exception-triggered condition, triggers the entrance function that each abnormal information trigger source is corresponding; And, utilize described entrance function acquisition abnormity information.
10. device according to claim 9, is characterized in that, described abnormal information acquisition module also for:
If, while monitoring CPU exception-triggered condition, trigger the entrance function that CPU is corresponding;
If while monitoring the memory abnormal trigger condition, trigger the entrance function that internal memory is corresponding;
If, while monitoring north bridge exception-triggered condition, trigger the entrance function that north bridge is corresponding;
If, while monitoring south bridge exception-triggered condition, trigger the entrance function that south bridge is corresponding.
11. according to Claim 8 or 9 or 10 described devices, it is characterized in that, described monitoring driving module is specifically for when electrifying startup equipment is carried out bios program, whether monitor generation system management interrupt SMI, anomalous event Event or unexpected message, if, determine and monitor the exception-triggered condition, wherein, described anomalous event or unexpected message trigger abnormal event or message for carrying out the meeting generated in described bios program.
12. device according to claim 8, it is characterized in that, described abnormal information acquisition module is specifically for when monitoring the exception-triggered condition, gather failure message, and the software transfer storehouse relation when gathering fault and occurring and/or the numerical value of programmable counter and program status register, to indicate the position of described failure message.
CN201310425373.9A 2013-09-17 2013-09-17 Fault locating method and device Pending CN103500133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310425373.9A CN103500133A (en) 2013-09-17 2013-09-17 Fault locating method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310425373.9A CN103500133A (en) 2013-09-17 2013-09-17 Fault locating method and device
PCT/CN2014/086684 WO2015039598A1 (en) 2013-09-17 2014-09-17 Fault locating method and device

Publications (1)

Publication Number Publication Date
CN103500133A true CN103500133A (en) 2014-01-08

Family

ID=49865348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310425373.9A Pending CN103500133A (en) 2013-09-17 2013-09-17 Fault locating method and device

Country Status (2)

Country Link
CN (1) CN103500133A (en)
WO (1) WO2015039598A1 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015039598A1 (en) * 2013-09-17 2015-03-26 华为技术有限公司 Fault locating method and device
CN104991801A (en) * 2015-07-06 2015-10-21 青岛海信宽带多媒体技术有限公司 Bootloader debugging information acquisition method, device and system
WO2015188619A1 (en) * 2014-06-09 2015-12-17 中兴通讯股份有限公司 Physical host fault detection method and apparatus, and virtual machine management method and system
CN105183575A (en) * 2015-08-24 2015-12-23 浪潮(北京)电子信息产业有限公司 Processor fault diagnosis method, device and system
CN105808398A (en) * 2016-03-08 2016-07-27 浪潮电子信息产业股份有限公司 Method for rapidly analyzing and positioning hardware exceptions
CN106227672A (en) * 2016-08-10 2016-12-14 中车株洲电力机车研究所有限公司 A kind of built-in application program fault catches and processing method
WO2017059721A1 (en) * 2015-10-09 2017-04-13 中兴通讯股份有限公司 Information storage method, device and server
TWI582586B (en) * 2016-06-01 2017-05-11 神雲科技股份有限公司 Method For Outputting Information Related To Machine Check Exception of Computer System
CN106789306A (en) * 2016-12-30 2017-05-31 深圳市风云实业有限公司 Restoration methods and system are collected in communication equipment software fault detect
CN107168815A (en) * 2017-05-19 2017-09-15 郑州云海信息技术有限公司 A kind of method for collecting hardware error message
CN108287775A (en) * 2018-03-01 2018-07-17 郑州云海信息技术有限公司 A kind of method, apparatus, equipment and the storage medium of server failure detection
CN108376107A (en) * 2018-03-01 2018-08-07 郑州云海信息技术有限公司 A kind of method, apparatus, equipment and the storage medium of server failure detection
CN108628726A (en) * 2017-03-22 2018-10-09 比亚迪股份有限公司 CPU state information recording method and device
CN109086155A (en) * 2018-07-27 2018-12-25 郑州云海信息技术有限公司 Server failure localization method, device, equipment and computer readable storage medium
CN109522057A (en) * 2018-11-27 2019-03-26 无锡睿勤科技有限公司 A kind of equipment starting method and equipment
CN110008056A (en) * 2019-03-28 2019-07-12 联想(北京)有限公司 EMS memory management process, device, electronic equipment and computer readable storage medium
CN110097683A (en) * 2018-07-20 2019-08-06 深圳怡化电脑股份有限公司 A kind of equipment self-inspection method, apparatus, ATM and storage medium
CN110289981A (en) * 2019-05-14 2019-09-27 中山大学 A kind of high-performance calculation Internet monitoring method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0606771A2 (en) * 1993-01-07 1994-07-20 International Business Machines Corporation Method and apparatus for providing enhanced data verification in a computer system
CN1506821A (en) * 2002-12-11 2004-06-23 联想(北京)有限公司 Detection and display method and device for computer self-test information
CN1731353A (en) * 2004-08-04 2006-02-08 英业达股份有限公司 Method for real-time presentation of solution for error condition of computer device
CN1869947A (en) * 2005-05-24 2006-11-29 乐金电子(昆山)电脑有限公司 Auto-diagnostic system of personal computer
CN101192181A (en) * 2006-11-22 2008-06-04 英业达股份有限公司 Power-on self-detection method
CN102402473A (en) * 2011-10-28 2012-04-04 武汉供电公司变电检修中心 Computer hardware and software fault diagnosis and repair system
CN102609350A (en) * 2012-02-15 2012-07-25 浪潮电子信息产业股份有限公司 Server memory failure alarm method
CN102708015A (en) * 2012-05-15 2012-10-03 江苏中科梦兰电子科技有限公司 Debugging method based on diagnosis of CPU (central processing unit) non-maskable interrupt system problems

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100371903C (en) * 2004-09-09 2008-02-27 英业达股份有限公司 Alarming system and method for intelligent platform event
JP5509568B2 (en) * 2008-10-03 2014-06-04 富士通株式会社 Computer apparatus, processor diagnosis method, and processor diagnosis control program
CN103500133A (en) * 2013-09-17 2014-01-08 华为技术有限公司 Fault locating method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0606771A2 (en) * 1993-01-07 1994-07-20 International Business Machines Corporation Method and apparatus for providing enhanced data verification in a computer system
CN1506821A (en) * 2002-12-11 2004-06-23 联想(北京)有限公司 Detection and display method and device for computer self-test information
CN1731353A (en) * 2004-08-04 2006-02-08 英业达股份有限公司 Method for real-time presentation of solution for error condition of computer device
CN1869947A (en) * 2005-05-24 2006-11-29 乐金电子(昆山)电脑有限公司 Auto-diagnostic system of personal computer
CN101192181A (en) * 2006-11-22 2008-06-04 英业达股份有限公司 Power-on self-detection method
CN102402473A (en) * 2011-10-28 2012-04-04 武汉供电公司变电检修中心 Computer hardware and software fault diagnosis and repair system
CN102609350A (en) * 2012-02-15 2012-07-25 浪潮电子信息产业股份有限公司 Server memory failure alarm method
CN102708015A (en) * 2012-05-15 2012-10-03 江苏中科梦兰电子科技有限公司 Debugging method based on diagnosis of CPU (central processing unit) non-maskable interrupt system problems

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015039598A1 (en) * 2013-09-17 2015-03-26 华为技术有限公司 Fault locating method and device
WO2015188619A1 (en) * 2014-06-09 2015-12-17 中兴通讯股份有限公司 Physical host fault detection method and apparatus, and virtual machine management method and system
CN104991801A (en) * 2015-07-06 2015-10-21 青岛海信宽带多媒体技术有限公司 Bootloader debugging information acquisition method, device and system
CN105183575A (en) * 2015-08-24 2015-12-23 浪潮(北京)电子信息产业有限公司 Processor fault diagnosis method, device and system
WO2017059721A1 (en) * 2015-10-09 2017-04-13 中兴通讯股份有限公司 Information storage method, device and server
CN105808398A (en) * 2016-03-08 2016-07-27 浪潮电子信息产业股份有限公司 Method for rapidly analyzing and positioning hardware exceptions
TWI582586B (en) * 2016-06-01 2017-05-11 神雲科技股份有限公司 Method For Outputting Information Related To Machine Check Exception of Computer System
CN106227672A (en) * 2016-08-10 2016-12-14 中车株洲电力机车研究所有限公司 A kind of built-in application program fault catches and processing method
CN106227672B (en) * 2016-08-10 2019-07-09 中车株洲电力机车研究所有限公司 A kind of built-in application program failure captures and processing method
CN106789306A (en) * 2016-12-30 2017-05-31 深圳市风云实业有限公司 Restoration methods and system are collected in communication equipment software fault detect
CN108628726A (en) * 2017-03-22 2018-10-09 比亚迪股份有限公司 CPU state information recording method and device
CN108628726B (en) * 2017-03-22 2021-02-23 比亚迪股份有限公司 CPU state information recording method and device
CN107168815A (en) * 2017-05-19 2017-09-15 郑州云海信息技术有限公司 A kind of method for collecting hardware error message
CN108376107A (en) * 2018-03-01 2018-08-07 郑州云海信息技术有限公司 A kind of method, apparatus, equipment and the storage medium of server failure detection
CN108287775A (en) * 2018-03-01 2018-07-17 郑州云海信息技术有限公司 A kind of method, apparatus, equipment and the storage medium of server failure detection
CN110097683A (en) * 2018-07-20 2019-08-06 深圳怡化电脑股份有限公司 A kind of equipment self-inspection method, apparatus, ATM and storage medium
CN109086155A (en) * 2018-07-27 2018-12-25 郑州云海信息技术有限公司 Server failure localization method, device, equipment and computer readable storage medium
CN109522057A (en) * 2018-11-27 2019-03-26 无锡睿勤科技有限公司 A kind of equipment starting method and equipment
CN110008056A (en) * 2019-03-28 2019-07-12 联想(北京)有限公司 EMS memory management process, device, electronic equipment and computer readable storage medium
CN110289981A (en) * 2019-05-14 2019-09-27 中山大学 A kind of high-performance calculation Internet monitoring method and system

Also Published As

Publication number Publication date
WO2015039598A1 (en) 2015-03-26

Similar Documents

Publication Publication Date Title
CN103500133A (en) Fault locating method and device
TWI229796B (en) Method and system to implement a system event log for system manageability
CN105938450B (en) The method and system that automatic debugging information is collected
CN104850485A (en) BMC based method and system for remote diagnosis of server startup failure
CN104639380A (en) Server monitoring method
CN104320308B (en) A kind of method and device of server exception detection
US20140019814A1 (en) Error framework for a microprocesor and system
US10592376B2 (en) Real-time hierarchical protocol decoding
CN102722431B (en) process monitoring method and device
JP2018116679A (en) Bus hang detection
TW201415213A (en) Self-test system and method thereof
JP2017091077A (en) Pseudo-fault generation program, generation method, and generator
CN106021066A (en) Fault information detection method and electronic device
CN109542752A (en) A kind of system and method for server PCIe device failure logging
CN102681928B (en) Abnormal information output system of computer system
CN107247505B (en) Cloud server power supply blackbox design method easy to view
CN101639816B (en) Real-time tracking system of bus and corresponding tracking and debugging method
JP5689783B2 (en) Computer, computer system, and failure information management method
US20120054376A1 (en) Real-time usb class level decoding
US8726102B2 (en) System and method for handling system failure
US8516311B2 (en) System and method for testing peripheral component interconnect express switch
CN104484260A (en) Simulation monitoring circuit based on GJB289 bus interface SoC (system on a chip)
CN108287780A (en) A kind of device and method of monitoring server CPLD states
CN107291596A (en) A kind of computer glitch maintenance system based on internet
CN103914362A (en) Serial port self-detection method, circuit and device

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140108

RJ01 Rejection of invention patent application after publication