WO2015196365A1 - 一种故障处理方法、相关装置及计算机 - Google Patents

一种故障处理方法、相关装置及计算机 Download PDF

Info

Publication number
WO2015196365A1
WO2015196365A1 PCT/CN2014/080618 CN2014080618W WO2015196365A1 WO 2015196365 A1 WO2015196365 A1 WO 2015196365A1 CN 2014080618 W CN2014080618 W CN 2014080618W WO 2015196365 A1 WO2015196365 A1 WO 2015196365A1
Authority
WO
WIPO (PCT)
Prior art keywords
error data
fault
computer
processor
management controller
Prior art date
Application number
PCT/CN2014/080618
Other languages
English (en)
French (fr)
Inventor
宋刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN201710454179.1A priority Critical patent/CN107357671A/zh
Priority to EP17199084.9A priority patent/EP3355197B1/en
Priority to PCT/CN2014/080618 priority patent/WO2015196365A1/zh
Priority to BR112016022329A priority patent/BR112016022329B1/pt
Priority to SG11201607545PA priority patent/SG11201607545PA/en
Priority to DK14896215.2T priority patent/DK3121726T3/en
Priority to EP14896215.2A priority patent/EP3121726B1/en
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2016562222A priority patent/JP6333410B2/ja
Priority to KR1020167027222A priority patent/KR101944874B1/ko
Priority to AU2014399227A priority patent/AU2014399227B2/en
Priority to CA2942045A priority patent/CA2942045C/en
Priority to ES14896215.2T priority patent/ES2667322T3/es
Priority to NO14896215A priority patent/NO3121726T3/no
Priority to CN201480056020.9A priority patent/CN105659215B/zh
Publication of WO2015196365A1 publication Critical patent/WO2015196365A1/zh
Priority to ZA2016/06180A priority patent/ZA201606180B/en
Priority to US15/385,701 priority patent/US10353763B2/en
Priority to US16/509,218 priority patent/US20190332453A1/en
Priority to US17/187,111 priority patent/US11360842B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0772Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0778Dumping, i.e. gathering error/state information after a fault for later diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Definitions

  • the embodiments of the present invention relate to computer technologies, and in particular, to a fault processing method, a related device, and a computer. Background technique
  • Computer failures can typically include software failures, hardware failures, operational (configuration) failures, and other failures. Because hardware faults are difficult to reproduce, mainly rely on manual experience to judge, when the error occurs, the problem is difficult to locate, and multiple insertions/replacements are required. Therefore, the most difficult to handle is hardware failure, such as memory, processor, input and output. (10) Faults caused by equipment, etc.
  • the computer fault is mainly processed by the following methods: When an uncorrectable error occurs in the system, the processor records the error data and notifies the operating system (OS); the OS grabs the notification after receiving the notification. The error data recorded by the processor is printed out for the user to analyze, locate and recover from the fault.
  • OS operating system
  • the embodiment of the invention provides a fault processing method, a related device and a computer, which can obtain error data in the computer after a serious uncorrectable error occurs in the computer, causing the computer to crash.
  • an embodiment of the present invention provides a computer, including a processor and a substrate management control.
  • the baseboard management controller is configured to send a read request message to the processor when determining that the computer is dead, the read request message is used to request to read the first error data recorded by the processor;
  • the processor is configured to receive the read request message, and send a read response message to the baseboard management controller;
  • the baseboard management controller is configured to receive the read response message returned by the processor, and obtain the first error data recorded by the processor according to the read response message.
  • the processor is further configured to acquire the first error data, and record the first error data;
  • the substrate management controller is configured to determine that the computer crash is: the baseboard management controller is configured to receive a severe fault event indication sent by the processor, where the severe fault event indication is that the processor is acquiring Transmitting the first error data and the first error data belongs to a serious uncorrectable error type; if the start of receiving the severe fault event indication, the processor is not received within a preset waiting time At least part of the first error data, the baseboard management controller is configured to determine that the computer is dead.
  • the baseboard management controller is configured to obtain, according to the read response message, the The first error data is specifically: when the first error data is carried in the read response message, the baseboard manager is configured to obtain the first error data recorded by the processor from the read response message.
  • the substrate management controller is configured to obtain, according to the read response message, the The first error data is specifically: when the read response message carries a read failure indication, the baseboard management controller is configured to instruct the hot restart module of the computer or the user to perform a hot restart on the computer; wherein, the reading The failure indication is for indicating that the reading of the first error data from the processor fails, so that the processor performs the metering when the computer is hot restarted a fault collection instruction of the basic input/output system of the computer, acquiring the first error data according to the fault collection instruction of the basic input/output system, and transmitting the first error data to the baseboard management controller; the baseboard management controller is configured to receive The first error data sent by the processor.
  • the substrate management controller is further configured to: The error data is parsed to obtain failure analysis information of the first error data.
  • the substrate management controller is further configured to perform fault analysis on the first error data according to a preset fault processing mechanism The information is analyzed and the troubleshooting suggestions are obtained.
  • the substrate management controller is further configured to receive a second error sent by the processor before determining that the computer is dead. Data, and parsing the second error data according to the fault resolution mechanism to obtain fault resolution information of the second error data, where the second error data is the first error generated by the computer Error data generated within a preset time before the data;
  • the substrate management controller is configured to analyze the fault resolution information of the first erroneous data according to the preset fault processing mechanism, and obtain the fault processing suggestions, including: the baseboard management controller is configured to use, according to the preset
  • the fault processing mechanism analyzes the fault analysis information of the second erroneous data and the fault analysis information of the first erroneous data to obtain the fault handling suggestion.
  • an embodiment of the present invention provides a fault processing method for a computer including a baseboard management controller and a processor, the method comprising:
  • the baseboard management controller sends a read request message to the processor when the computer is determined to be dead, the read request message is used to request to read the first error data recorded by the processor; Receiving a read response message returned by the processor, and obtaining the first error data recorded by the processor according to the read response message.
  • the substrate management controller receives a severe fault event indication sent by the processor, where the severe fault event indication is the processor Transmitted when the first erroneous data is acquired and the first erroneous data belongs to a serious uncorrectable error type; if the severe fault event indication is received, the preset waiting time is not received At least part of the first error data sent by the processor determines that the computer is dead.
  • the baseboard management controller receives a read response message returned by the processor, and according to the read response a message, the obtaining the first error data recorded by the processor includes: the substrate management controller obtaining the processing from the read response message when the first error data is carried in the read response message The first error data recorded by the device.
  • the baseboard management controller receives a read response message returned by the processor, and according to the read response
  • the obtaining, by the processor, the first error data recorded by the processor includes: when the substrate management controller carries a read failure indication in the read response message, indicating a hot restart module of the computer or a user to the computer Performing a hot restart, so that the processor executes a fault collection instruction of the basic input/output system of the computer when the computer is hot restarted, and acquires the first error according to the fault collection instruction of the basic input output system.
  • the baseboard management controller obtains the processor according to the read response message After the first error data is recorded, the method further includes: the substrate management controller parses the first error data according to the fault resolution mechanism, to obtain fault analysis information of the first error data.
  • the method further includes: the substrate management controller, according to a preset fault processing mechanism, The fault analysis information of an erroneous data is analyzed to obtain a fault handling suggestion.
  • the method before the substrate management controller determines that the computer is dead, the method further includes: receiving, by the substrate management controller The second error data sent by the processor; wherein the second error data is error data generated within a preset time before the computer generates the first error data;
  • the fault processing mechanism is configured to analyze the fault analysis information of the first erroneous data, and the fault processing suggestions are:
  • the substrate management controller parses the second error data according to the fault resolution mechanism, and obtains fault analysis information of the second error data, and according to the preset fault processing mechanism, The fault analysis information of the second error data and the fault analysis information of the first error data are analyzed to obtain the fault handling suggestion.
  • an embodiment of the present invention provides a baseboard management controller, including: a sending unit, configured to send a read request message to the processor when determining that the computer is dead, the read request message is used to request Reading the first error data recorded by the processor;
  • a receiving unit configured to receive a read response message returned by the processor, and obtain the first error data recorded by the processor according to the read response message.
  • the baseboard management controller further includes: a determining unit, configured to receive a serious fault event indication sent by the processor, where the severe fault event indication is Transmitting, when the first error data is acquired by the processor and the first error data belongs to a serious uncorrectable error type; if receiving from the receiving the severe fault event indication, not receiving within a preset waiting time And at least part of the first error data sent by the processor, determining that the computer is dead.
  • the receiving unit receives a read response message returned by the processor, and according to the read response message, Obtaining the first error data recorded by the processor includes: receiving, by the receiving unit, the first error data in the read response message, obtaining the The first erroneous data recorded by the processor.
  • the receiving unit receives a read response message returned by the processor, and according to the read response message, Obtaining the first error data recorded by the processor includes:
  • the receiving unit when the read response message carries a read failure indication, instructs a hot restart unit of the computer or a user to perform a hot restart on the computer, so that the processor performs when the computer is hot restarted a fault collection instruction of the basic input/output system of the computer, acquiring the first error data according to the fault collection instruction of the basic input/output system, and transmitting the first error data to the receiving unit; wherein the read failure indication is used for Instructing to read the first erroneous data from the processor fails; the receiving unit receiving the first erroneous data sent by the processor.
  • the substrate management controller further includes: a fault processing unit, configured to And parsing the first error data to obtain failure analysis information of the first error data.
  • the fault analysis information is analyzed to obtain a fault processing proposal.
  • the receiving unit is further configured to receive second error data that is sent by the processor, where the fault processing unit is further configured to: And parsing the second erroneous data according to the fault resolution mechanism to obtain fault analysis information of the second erroneous data; wherein the second erroneous data is pre-processed by the computer before the generating the first erroneous data
  • the fault processing unit is configured to analyze the fault analysis information of the first erroneous data according to the preset fault processing mechanism, and obtain the fault processing suggestions, where the fault processing unit includes: the fault processing unit According to the preset fault processing mechanism, fault resolution information of the second error data and a fault resolution signal of the first error data The information is analyzed to obtain the fault handling suggestion.
  • an embodiment of the present invention provides a substrate management controller, where the substrate management controller includes a processor, a memory, a bus, and a communication interface;
  • the memory is configured to store a computer execution instruction
  • the processor is connected to the memory through the bus, and when the substrate management controller is running, the processor executes the computer execution instruction stored in the memory,
  • the embodiment of the present invention provides a computer readable medium, including a computer executing instruction, when the processor of the computer executes the computer to execute an instruction, where the computer performs the fault processing method described in the second aspect, Or the fault processing method described in any of the possible implementations of the second aspect.
  • the baseboard management controller in the computer may send a read request message to the processor in the computer when determining that the computer is dead, the read request message is used to request to read the record recorded by the processor.
  • FIG. 1 is a schematic diagram of a computer according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of still another computer according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a method for processing a fault according to an embodiment of the present invention
  • 4 is a flowchart of a method for processing a fault according to another embodiment of the present invention
  • FIG. 5 is a schematic diagram of a substrate management controller according to an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram of another substrate management controller according to an embodiment of the present invention.
  • the embodiment of the invention provides a fault processing method, a related device and a computer, which can obtain error data in the computer after a serious uncorrectable error occurs in the computer, causing the computer to crash.
  • FIG. 1 is a schematic diagram of a computer according to an embodiment of the present invention.
  • the computer includes a processor 11 and a Baseboard Management Controller (BMC).
  • BMC Baseboard Management Controller
  • the baseboard management controller 12 is configured to determine the computer.
  • a read request message is sent to the processor 11, and the read request message is used to request to read the first error data recorded by the processor 11; wherein the first error data is generated in the computer
  • the error data may be all error data generated in the computer, and may also be partial error data generated in the computer, for example, the first error data may be error data generated within 2 seconds before the computer crashes.
  • the embodiments of the present invention are not limited herein.
  • the processor 11 is configured to receive the read request message and send a read response message to the baseboard management controller 12; at this time, although the computer has crashed, the processor cannot execute any computer instruction, but the The processor can receive and respond to the read request message.
  • the baseboard management controller 12 is configured to receive the read response message returned by the processor 11, and obtain the first error data recorded by the processor 11 according to the read response message.
  • the processor 11 may record the first error data in its own register, and the baseboard management controller 12 may send a read request message to the processor 11 by using the address of the register, to Acquiring the first error data in a register; although the computer has crashed and cannot run a computer instruction, the register of the processor 11 may return a read response message in response to the read request message, for example, returning the An error data, such that the substrate management controller 12 can obtain the first error data according to the read response message.
  • the first erroneous data may include one or more erroneous data, which is not limited herein.
  • the baseboard management controller 12 may send a read request message to the processor 11 when determining that the computer is dead, the read request message is used to request to read the first error recorded by the processor 11. Data, and receiving a read response message returned by the processor 11, and obtaining the first error data recorded by the processor 11 according to the read response message.
  • the embodiment of the invention does not need to utilize the operating system, and only realizes the acquisition of the erroneous data in the computer after the computer crashes through the substrate management controller, and solves the problem that the computer cannot be acquired after the system crashes due to the serious uncorrectable error of the computer in the prior art. The problem with the wrong data.
  • Uncorrectable errors caused by computer failures can be classified into Catastrophic Errors, Fatal Errors, and Recoverable Errors. Among them, catastrophic errors and fatal errors are the most serious, which may cause the computer to have a blue screen, a purple screen or even a crash (such as a black screen and hanging). Therefore, you can monitor for catastrophic or fatal errors in your computer, such as internal errors (IERR; catastrophic errors) or mechanical check errors (MCERR; fatal errors). If a catastrophic error or fatal error occurs in the computer, if the computer cannot run the Basic Input Output System (BIOS) command, or the operating system (OS) command, you can confirm that the computer is dead. Machine.
  • BIOS Basic Input Output System
  • OS operating system
  • the processor 11 may be further configured to acquire the first error data, and record the first error data, for example, the processor 11 may generate or receive the first error data, and The first error data is recorded in a cache of the computer or in a register of the processor 11 or other module having storage capability; on the one hand, after the processor 11 obtains the first error data, if The computer 11 does not crash, the processor 11 may send the first error data to the baseboard management controller, for example, pre-configuring an error collection instruction of the basic input/output system in the computer, if The processor 11 executes the error collection instruction of the basic input/output system, and sends the first error data to the baseboard management controller 12 according to the error collection instruction of the basic input/output system.
  • the baseboard management controller for example, pre-configuring an error collection instruction of the basic input/output system in the computer, if The processor 11 executes the error collection instruction of the basic input/output system, and sends the first error data to the baseboard management controller 12 according to the error collection instruction of the basic input/
  • the processor 11 may further send a severe fault event indication to notify the substrate.
  • the management controller 12 generates a catastrophic error or a fatal error that may cause a crash, wherein the first erroneous data belongs to a serious uncorrectable error type, meaning that the first erroneous data is a catastrophic error or a fatal error;
  • the baseboard management controller 12 may be configured to receive a severe fault event indication sent by the processor 11, and if the hard fault event indication is received, the processor 11 is not received within a preset waiting time. And transmitting at least part of the first error data, the baseboard management controller 12 may determine that the computer is dead.
  • the baseboard management controller 12 may further determine that the computer is dead according to an instruction of the user. For example, the user may notify the baseboard management controller 12 when the computer is found dead, and the baseboard management controller 12 may be based on the user. The indication determines that the computer is dead, thereby initiating acquisition of the first erroneous data.
  • the processor 11 may receive the read request message according to the read request message. Carrying the first error data in the read response message, and returning to the baseboard management controller
  • the substrate management controller 12 may obtain the first error data recorded by the processor 11 from the read response message.
  • the baseboard management controller 12 may not successfully read the first error data, and the read response message carries a read failure indication, the read failure indication For indicating that the reading of the first error data from the processor 11 fails, the baseboard management controller 12 may be configured to indicate a hot restart module of the computer or a user to perform a hot restart of the computer, to And causing the processor 11 to execute a fault collection instruction of the basic input/output system of the computer when the computer is hot restarted, acquiring the first error data according to the fault collection instruction of the basic input output system, and sending The substrate management controller 12 is configured to receive the first error data sent by the processor 11 to complete acquisition of the first error data.
  • the computer restart can be divided into a hot restart and a cold restart.
  • the cold restart will power off the computer and initialize the computer. After a cold restart, information may be lost. For example, the register in the processor is saved after a cold restart. The information will be lost. Pressing the power switch is a cold restart to the computer. The hot restart is different from the cold restart. The computer will not be powered off, the computer will not be initialized, and the information stored in the registers in the processor will not be lost. Clicking "Restart" at the beginning According to the normal procedure, the startup and startup of the computer are hot restarts; in the embodiments of the present invention and the subsequent embodiments, the hot restart of the computer refers to the above meaning.
  • the baseboard management controller 12 may be further configured to send a clear data message to the processor 11 after the first error data is acquired, to instruct the processor 11 to delete the first record of the self record. Wrong data, avoiding waste of storage resources.
  • the baseboard management controller 12 is further configured to: after receiving the severe fault event indication sent by the processor 11, send an alarm message to the fault alarm module of the computer or perform a printing operation, to The severe fault alarm event notifies the user to let the user know the computer fault in time. (3) Analysis, location and processing of faults
  • the complete fault record can be recorded by the baseboard management controller 12, and the fault source can be automatically located and the fault handling suggestions can be given, which provides assistance for timely processing and recovery of the fault.
  • the first error data recorded by the processor 11 is usually information represented by "0" or "1", and therefore, the baseboard management controller 12 may be further configured to: according to the fault resolution mechanism, the first The error data is parsed to obtain failure analysis information of the first error data, and the failure analysis information of the first error data may include a generation time of each error data in the first error data, and who collects the error data. Which processor and which core the error data comes from
  • the baseboard management controller 12 can parse the first error data in binary form according to Intel's fault code definition to obtain fault resolution information.
  • the fault resolution information can be provided not only to the maintenance personnel or the user to understand the fault condition, but also for subsequent fault location, analysis and processing.
  • the baseboard management controller 12 is further configured to analyze the fault analysis information of the first erroneous data according to a preset fault handling mechanism, to obtain a fault handling suggestion.
  • the preset suggestion may include fault location information and/or processing suggestion information such that the user or the fault repairer may process the computer to recover the computer according to the fault handling suggestion.
  • the first error data may be only erroneous data generated within a short period of time before the computer crashes, for example, the first error data is erroneous data generated within 0.5 seconds before the computer crashes, therefore, In order to improve the accuracy of fault location and analysis, more fault analysis information of erroneous data can be analyzed.
  • the substrate management controller 12 may further receive second error data sent by the processor 11 before determining that the computer is dead, the second error data is different from the first error data,
  • the second error data is the calculation The erroneous data generated in the preset time before the first error data is generated; the substrate management controller 12 may parse the second erroneous data according to the fault resolution mechanism to obtain the second erroneous data.
  • the failure analysis information is analyzed, and the failure analysis information of the second error data and the failure analysis information of the first error data are analyzed according to a preset fault processing mechanism, and the fault processing suggestion is obtained.
  • the first error data may be error data generated within 0.5 seconds before the computer crashes.
  • the second error data may be 5 seconds before the computer crashes to a crash.
  • the erroneous data generated in the first 0.5 the substrate management controller 12 may analyze the fault analysis information of the erroneous data within 5 seconds before the computer crashes according to a preset fault handling mechanism, and obtain the fault handling suggestion. .
  • the baseboard management controller 12 may be further configured to print fault resolution information of the first error data or fault resolution information of the first error data or the fault handling suggestion, so that the user or the faulty maintenance personnel may Printed information that handles the failure of the computer.
  • the substrate management controller 12 may further perform at least one of failure analysis information of the first error data, failure analysis information of the second error data, the first error data, and the second error data.
  • a fault information library saved to the computer, and obtaining a fault record of the computer, thereby providing assistance for positioning and recovering subsequent faults.
  • the baseboard management controller 12 may fault the first fault data.
  • the fault analysis information of the parsing information and the second erroneous data is saved to the fault information database, so that the fault information library stores complete error data, and a complete fault record can be provided.
  • the fault information base may be provided in the baseboard management controller 12 or may be provided in the baseboard management controller 12.
  • the faults of the computer can be located, analyzed and processed, which can be used in different application scenarios and in different ways.
  • a plurality of computers according to the embodiments of the present invention may be included in the system, and each of the computers described in the embodiments of the present invention may have the capability of fault location, analysis, and processing.
  • the baseboard management controller of the computer collects the error data, and performs unified fault location, analysis and processing on all the computers in the system; or the basic management controller of the plurality of computers in the system can report the error data obtained by itself
  • the management device such as the management server
  • the management device performs unified fault location, analysis, and processing on all computers in the system by using the method described in the embodiment of the present method.
  • the embodiment of the present invention does not need to utilize the operating system, and only obtains the error data in the computer after the computer crashes through the substrate management controller 12, and solves the problem that the computer cannot be obtained after the system crashes due to serious uncorrectable errors in the prior art.
  • the problem of erroneous data in the computer may further record a complete fault record in the fault record library, and may further parse the first fault data, and according to a preset fault handling mechanism, the first The fault analysis information of the erroneous data is analyzed, the fault source is located, and processing suggestions are given.
  • FIG. 2 is a schematic structural diagram of a computer according to an embodiment of the present invention.
  • the computer is composed of a processor 11 and a baseboard management controller 12; the processor 11 may include a recording module 21, a storage module 22, and an instruction execution module.
  • the recording module 21 may specifically be a hardware check architecture (Machine Check Architecture, which is responsible for each internal function module of the processor 11).
  • the storage module 22 may be a register of the MCA and/or a register of the AER, the MCA
  • the register and the register of the AER may be located inside the processor 11
  • the instruction execution module 23 may be a kernel of the processor 11 for executing instructions of the basic input/output system and instructions of the operating system
  • the recording module 21 may be configured to acquire error data in the computer, for example, generate fault data generated by each internal function module in the processor 11 to generate erroneous data, and receive, for example, 10 device failures to generate erroneous data, errors in the computer.
  • the data includes, but is not limited to, the first error data and the second error data in the embodiment of the present invention
  • the recording module 21 may record the acquired error data in the computer in the storage module 22; Specifically, if the error data in the computer is acquired by the MCA, the MCA may record the error data in the computer in the register of the MCA if the error data in the computer is acquired by the AER.
  • the AER may record the error number of the computer into the register of the AER, where the range of the error data acquired by the MCA or the AER may be implemented by configuring the corresponding register by the BIOS; optionally, the The MAC or AER may also record the calculation when the error data in the computer is recorded to the corresponding register.
  • the address of the register of the erroneous data in the machine is stored in the first register, so that the instruction execution module 23 can acquire the address recorded in the first register according to the error collection instruction of the basic input/output system.
  • the error data in the computer is stored in the first register, so that the
  • the recording module 21 may also trigger a System Management Interrupt (SMI) when acquiring error data in the computer; the system management interrupt is used to trigger the instruction execution module 23 to execute the basic input.
  • SMI System Management Interrupt
  • the instruction execution module 23 may acquire error data in the computer from the storage module 22 according to an error collection instruction of the basic input output system.
  • the substrate management controller 12 if the computer freezes, the instruction execution module 23 cannot execute any computer instruction; wherein the error collection instruction of the basic input output system can be pre-configured in storing the substrate Input and output system instructions in the memory.
  • the second error data is error data generated within a preset time before the computer generates the first error data, so the recording module 21 will first obtain the first data.
  • the second error data is obtained by the first error data; the recording module 21 may, when acquiring the second error data, record the second error data to the
  • the storage module 22, on the other hand, may trigger a system management interrupt; if the computer does not crash, the instruction execution module 23 may execute an error collection instruction of the basic input/output system according to the system management interrupt, according to the An error collection instruction of the basic input/output system, the second error data is obtained from the storage module 22 and sent to the baseboard management controller 12; optionally, the instruction execution module 23 can pass the smart platform management interface ( The Intelligent Platform Management Interface (IPMI) standard sends the second error data to the baseboard management controller 12, and the baseboard management controller 12 can receive the second error sent by the instruction execution module 23 through an IPMI standard.
  • IPMI Intelligent Platform Management Interface
  • the recording module 21 may acquire each time.
  • the system management interrupt is triggered when part of the second error data is received, and accordingly, the The execution module 23 may execute the error collection instruction of the basic input/output system a plurality of times to generate the second error data into the substrate management controller 12 multiple times; optionally, the instruction execution module
  • a deletion instruction of the operating system may be executed, and the second error saved in the recording module 21 is deleted according to the deletion instruction of the operating system.
  • the instruction execution module 23 can delete the error data that has been sent to the baseboard management controller 12 from the storage module 22, and avoid sending the erroneous data repeatedly to the baseboard management controller. 12.
  • the system management interrupt may also be triggered; further, if the first error data belongs to a serious uncorrectable
  • the recording module 21 may also trigger a severe fault event indication to notify the baseboard management controller 12 that the computer has generated a catastrophic error. Or a fatal error may cause a crash; when the first erroneous data really belongs to a serious uncorrectable error type, and the computer crashes, the instruction execution module 23 will be unable to execute the computer instruction even if the recording module 21 triggers a system management interrupt, and the instruction execution module 23 cannot execute the basic input.
  • the baseboard management controller 12 starts from receiving the severe fault event indication If at least part of the first error data sent by the processor 11 is not received within a preset waiting time, the computer may be determined to be dead.
  • the triggering module 21 triggers a severe fault event indication by changing the level of the pin CATERER_N or ERROR_N, and the baseboard management controller 12 can receive the pin CATERER_N or ERROR-N. Level signal to receive the severe fault event indication.
  • the baseboard management controller 12 may send a read request message to the recording module 21, where the read request message is used to request to read the first error data; After the computer crashes, the read request message can still be received, and a read response message is sent to the baseboard management controller 12; thus the baseboard management controller 12 can receive the read response message and according to the read Receiving, by the response message, the first error data recorded by the processor 11; specifically, the baseboard management controller 12 may traverse the register of the MAC through a platform environment control interface (PECI) bus Or the register of the AER to read the first error data from a register of the MAC or a register of the AER; if the baseboard management controller 12 is from a register of the MAC or a register of the AER If the data is successfully read, the first error is carried in the read response message returned by the register of the MAC or the register of the AER.
  • PECI platform environment control interface
  • the baseboard management controller 12 may acquire the first error data; if the baseboard management controller 12 fails to read data from a register of the MAC or a register of the AER, the register of the MAC Or the read response message returned by the register of the AER carries a read failure indication, such as garbled, the baseboard management controller 12 may instruct the hot restart module of the computer or the user to perform a hot restart on the computer, so that the The instruction execution module 23 executes a fault collection instruction of the basic input/output system when the computer is thermally restarted, and traverses the register of the MAC or the register of the AER according to the fault collection instruction of the basic input/output system. Obtaining the first error data, and sending the data to the baseboard management controller 12 through an IPMI standard, The baseboard management controller 12 may receive the first error data sent by the fault collection instruction.
  • the baseboard management controller 12 cooperates with the processor 11 to obtain the error data in the computer after the computer crashes, and solves the problem that the computer has serious uncorrectable errors in the prior art. The problem of incorrect data in the computer could not be obtained after the system crashed.
  • the embodiment of the present invention provides a fault processing method for the computer shown in FIG. 1 or FIG. 2, the computer includes a baseboard management controller and a processor, and the method includes:
  • the baseboard management controller sends a read request message to the processor when determining that the computer is dead, the read request message is used to request to read the first error data recorded by the processor.
  • the processor may acquire the first error data and record the first error data.
  • the baseboard management controller when determining that the computer is dead, may send a read request message to the processor to read the first erroneous data recorded by the processor.
  • the processor may receive and respond to the read request message, so that the baseboard management controller may acquire the first error data;
  • the processor may record the first error data in its own register, and the baseboard management controller may send the read request message to a register of the processor, where the register of the processor may The read request message is received and a read response message is returned.
  • the first erroneous data may include one or more erroneous data, which is not limited herein.
  • the substrate management controller determines that the computer is in a plurality of manners. For details, refer to the first embodiment or the second embodiment.
  • the baseboard management controller receives a read response message returned by the processor, and obtains the first error data recorded by the processor according to the read response message.
  • the read response is cancelled
  • the first error data may be carried, and the baseboard management controller may obtain the first error data recorded by the processor from the read response message; if the baseboard management controller processes from the search If the read data response fails, the read response message may carry a read failure indication, and the baseboard management controller may obtain the first error data by other means, for example, the basic input may be configured in the computer in advance.
  • the baseboard management controller may instruct the hot restart module of the computer or the user to perform a hot restart on the computer, so that the processor Performing a fault collection instruction of the basic input/output system of the computer during the hot restart of the computer, acquiring the first error data according to the fault collection instruction of the basic input output system, and transmitting the first error data to the substrate management control
  • the baseboard management controller may receive the location sent by the processor First error data, completing the acquisition of the first error data.
  • the baseboard management controller of the computer may send a read request message to the processor of the computer when the computer is determined to be dead, the read request message is used to request to read the record of the processor.
  • the first error data receiving a read response message returned by the processor, and obtaining the first error data recorded by the processor according to the read response message.
  • the embodiment of the invention does not need to utilize the operating system, and only realizes the acquisition of the erroneous data in the computer after the computer crashes through the substrate management controller, and solves the problem that the computer cannot be acquired after the system crashes due to the serious uncorrectable error of the computer in the prior art. The problem with the wrong data.
  • the embodiment of the present invention provides a fault processing method for the computer shown in FIG. 1 or FIG. 2, the computer includes a baseboard management controller and a processor, and the method includes:
  • the baseboard management controller receives a severe fault event indication sent by the processor, where the severe fault event indication is that the processor sends when the first error data is acquired and the first error data belongs to a serious uncorrectable error type. of.
  • the baseboard management controller sends an alarm message to the fault alarm module of the computer or performs a printing operation to notify the user of the severe fault alarm event. After receiving the severe fault event indication sent by the processor, the baseboard management controller may trigger the fault alarm sensor or perform a printing operation by using an alarm message to notify the user that the computer has generated a serious fault and may cause a crash.
  • S402 is an optional step.
  • Step S403 If, after receiving the severe fault event indication, the substrate management control does not receive at least part of the first error data sent by the processor, and determines that the computer crashes, Step S404 is performed.
  • the processor may execute an error collection instruction of the basic input/output system, according to error collection of the basic input output system.
  • the baseboard management controller sends a read request message to the processor, where the read request message is used to request to read the first error data recorded by the processor.
  • the baseboard management controller may acquire the first error data from the processor to implement acquisition of erroneous data in the computer after the computer crashes.
  • the baseboard management controller receives a read response message returned by the processor, and obtains the first error data recorded by the processor according to the read response message.
  • the substrate management controller obtains the first error data recorded by the processor according to the read response message, which may be the method described in S405a or the method described in S405a.
  • S405a If the first error data is carried in the read response message, the substrate management controller obtains the first error data recorded by the processor from the read response message.
  • the baseboard management controller may Obtaining the first error data recorded by the processor in the reading response message.
  • the baseboard management controller indicates the computer a hot restart module or a user performing a hot restart on the computer to cause the processor to execute a fault collection instruction of the basic input/output system of the computer when the computer is thermally restarted, according to the failure of the basic input/output system Collecting instructions, acquiring the first error data, and sending the data to the baseboard management controller; and receiving, by the baseboard management controller, the first error data sent by the processor.
  • the fault collection instruction of the basic input/output system may be pre-configured in the computer, and when the substrate management controller fails to read the first error data from the processor, the read response message is Carrying a read failure indication, the baseboard management controller instructing a hot restart module of the computer or a user to perform a hot restart on the computer, so that the processor performs basic input of the computer when the computer is warm restarted
  • the fault collection instruction of the output system acquires the first error data according to the fault collection instruction of the basic input/output system, and sends the first error data to the baseboard management controller.
  • the substrate management controller parses the first error data according to a fault resolution mechanism to obtain fault resolution information of the first error data.
  • the substrate management controller parses the first error data according to the fault resolution mechanism, and obtains fault analysis information of the first error data.
  • the failure resolution information of the first error data may include a generation time of each error data in the first error data, who collected the error data, which processor the error data comes from, which core (Core), and what belongs to Errors and the like; the fault resolution information can be provided not only to the maintenance personnel or the user to understand the fault condition, but also can be used for subsequent fault location, analysis and processing.
  • the substrate management controller analyzes the fault analysis information of the first erroneous data according to a preset fault processing mechanism, and obtains the fault processing suggestion.
  • the substrate management controller analyzes the fault analysis information of the first erroneous data according to a preset fault processing mechanism, and obtains the fault processing suggestion, where the fault processing suggestion may be fault location information or processing suggestion information.
  • the fault processing suggestion may be fault location information or processing suggestion information.
  • the substrate management controller may print out the fault processing suggestion, or may print out the fault processing suggestion together with the fault analysis information of the first error data, thereby The user or the troubleshooter can process the computer based on the printed information to recover the computer.
  • the embodiment of the invention does not need to utilize the operating system, and only realizes the acquisition of the erroneous data in the computer after the computer crashes through the substrate management controller, and solves the problem that the computer cannot be acquired after the system crashes due to the serious uncorrectable error of the computer in the prior art.
  • the problem with the wrong data may further parse the first error data, analyze the fault analysis information of the first error data according to a preset fault processing mechanism, locate the fault source, and Suggestions for processing.
  • step S407 only the fault analysis information of the first erroneous data is analyzed to obtain a fault handling suggestion, and the first erroneous data may be only erroneous data generated within a short period of time before the computer crashes, for example, The first error data is erroneous data generated within 2 seconds before the computer crashes. Therefore, in order to improve the accuracy of fault location and analysis, more fault analysis information of the erroneous data can be analyzed.
  • the substrate management controller may further receive second error data sent by the processor, where the second error data is an error generated within a preset time before the computer generates the first error data. data.
  • Step S407 may be: the substrate management controller parses the second error data according to the fault resolution mechanism, and obtains fault analysis information of the second error data, and The fault analysis information of the second error data and the fault analysis information of the first error data are analyzed to obtain the fault handling suggestion.
  • the substrate management controller may analyze failure analysis information of the second erroneous data and failure analysis information of the first erroneous data, obtain the fault processing suggestion, and improve fault location. And the accuracy of the analysis.
  • the baseboard management controller may further include failure analysis information of the first error data, failure analysis information of the second error data, the first error data, and the first At least one of the two error data is saved to the fault information base of the computer. For example, saving the fault resolution information of the first error data and the fault resolution letter of the second error data to the fault information register, or saving the first error data and the second error data to the location The fault information is stored, so that a complete fault record is recorded in the fault record library.
  • the baseboard management controller may further send a clear data message to the processor, to instruct the processor to delete the first error data recorded by itself, to avoid waste of storage resources.
  • the substrate management controller in the third embodiment or the fourth embodiment of the present invention may specifically perform the interaction and fault processing with the processor according to the substrate management controller according to the first embodiment or the second embodiment of the present invention.
  • Embodiments of the present invention provide a baseboard management controller for a computer including the baseboard management controller and a processor, for example, for a computer as described in FIG. 1 or 2, as shown in FIG.
  • the substrate management control may include a transmitting unit and a receiving unit;
  • the sending unit is configured to send, when the computer is dead, a read request message to the processor, where the read request message is used to request to read the first error data recorded by the processor; Already dead, the processor is unable to execute any computer instructions, but the processor can receive and respond to the read request message;
  • the receiving unit is configured to receive a read response message returned by the processor, and according to the reading Responding to the message, obtaining the first erroneous data recorded by the processor.
  • the receiving unit may obtain the first error data recorded by the processor from the read response message when the first error data is carried in the read response message; for example, the receiving The unit may, when the read response message carries a read failure indication, instruct the hot restart unit of the computer or the user to perform a hot restart on the computer, so that the processor performs the hot restart when the computer restarts a fault collection instruction of the basic input/output system of the computer, acquiring the first error data according to the fault collection instruction of the basic input/output system, and transmitting the first error data to the receiving unit; wherein the read failure indication is used to indicate The processor fails to read the first error data; the receiving unit receives the first error data sent by the processor.
  • the receiving unit may further send a clear data message to the processor after acquiring the first error data, to instruct the processor to delete the first error data recorded by
  • the baseboard management controller may further include: a determining unit, configured to receive a severe fault event indication sent by the processor, where the severe fault event indication is that the processor is acquiring the first fault data And transmitting when the first error data belongs to a serious uncorrectable error type; if at least part of the sending by the processor is not received within a preset waiting time from receiving the severe fault event indication The first error data determines that the computer is dead.
  • a determining unit configured to receive a severe fault event indication sent by the processor, where the severe fault event indication is that the processor is acquiring the first fault data And transmitting when the first error data belongs to a serious uncorrectable error type; if at least part of the sending by the processor is not received within a preset waiting time from receiving the severe fault event indication The first error data determines that the computer is dead.
  • the baseboard management controller may further include: a fault alarm unit, configured to send an alarm message to the fault failure alarm unit of the computer after the determining unit receives the severe fault event indication sent by the processor A printing operation is performed to notify the user of the severe fault alarm event.
  • a fault alarm unit configured to send an alarm message to the fault failure alarm unit of the computer after the determining unit receives the severe fault event indication sent by the processor A printing operation is performed to notify the user of the severe fault alarm event.
  • the baseboard management controller may further include a fault processing unit, configured to parse the first error data according to the fault resolution mechanism to obtain fault resolution information of the first error data.
  • the failure resolution information of the first error data may include a generation time of each error data in the first error data, who collected the error data, which processor the error data comes from, Which core (Core), what is wrong, etc.; the fault resolution information can be provided not only to the maintenance personnel or the user to understand the fault condition, but also for subsequent fault location, analysis and processing.
  • the fault analysis information of the first error data is analyzed to obtain a fault handling suggestion.
  • the preset is considered to be fault location information or processing suggestion information, so that the user or the fault repairer can process the computer according to the fault handling suggestion to recover the computer.
  • the fault processing unit only analyzes the fault resolution information of the first erroneous data to obtain a fault processing suggestion, and the first erroneous data may be only erroneous data generated within a short period of time before the computer crashes, for example, the first An error data is erroneous data generated within 0.8 seconds before the computer crashes. Therefore, in order to improve the accuracy of fault location and analysis, the fault processing unit can analyze more fault analysis information of the erroneous data.
  • the receiving unit is further configured to receive second error data sent by the processor; the second error data is error data generated within a preset time before the computer generates the first error data;
  • the second error data may be parsed according to the fault resolution mechanism, and the fault analysis information of the second error data is obtained, and the fault analysis of the second error data is performed according to the preset fault processing mechanism.
  • the information and the fault analysis information of the first error data are analyzed to obtain the fault handling suggestion.
  • the fault processing unit is further configured to print fault resolution information of the first error data or the fault handling suggestion.
  • the fault processing unit is further configured to: the fault resolution information of the first error data, the fault resolution information of the second error data, the first error data, and the second error data At least one type of fault information stored in the computer; for example, saving fault resolution information of the first error data and a fault resolution letter of the second error data to the fault information stock, or An error data and the second error data are saved to the fault information stock to record a complete fault record in the fault log library.
  • the substrate management controller in the embodiment of the present invention may refer to the substrate management controller according to the first embodiment or the second embodiment of the present invention to interact with the processor and perform fault processing.
  • the sending unit may send a read request message to the processor of the computer when determining that the computer is dead, the read request message is used to request to read the first record of the processor.
  • the receiving unit may receive a read response message returned by the processor, and obtain the first error data recorded by the processor according to the read response message.
  • the embodiment of the invention does not need to utilize the operating system, and only realizes the acquisition of the erroneous data in the computer after the computer crashes through the substrate management controller, and solves the problem that the computer cannot be acquired after the system crashes due to the serious uncorrectable error of the computer in the prior art. The problem with the wrong data.
  • the embodiment of the present invention provides a computer readable medium, including a computer executing instruction, when the processor of the computer executes the computer execution instruction, the computer may perform the fault processing method described in Embodiment 3 or Embodiment 4. .
  • FIG. 6 is a substrate management controller according to an embodiment of the present disclosure, where the substrate management controller may include:
  • the processor 601, the memory 602, and the communication interface 605 are connected by the system bus 604 and complete communication with each other.
  • Processor 601 may be a single core or multi-core central processing unit, or a particular integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention.
  • the memory 602 can be a high speed RAM memory or a non-volatile memory, and at least one disk can be stored.
  • Memory 602 is used by computer to execute instructions 603. Specifically, the program code may be included in the computer execution instruction 603.
  • the processor 601 runs the computer execution instruction 603, and may execute the method flow of the fault processing method described in the third embodiment or the fourth embodiment.
  • aspects of the present invention, or possible implementations of various aspects may be embodied as a system, method, or computer program product. Therefore, the present invention Various aspects, or possible implementations of various aspects, may be in the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, etc.), or a combination of software and hardware aspects, collectively referred to herein For "circuit", “module” or “system”. Furthermore, aspects of the invention, or possible implementations of various aspects, may take the form of a computer program product, which is a computer readable program code stored on a computer readable medium.
  • the computer readable medium can be a computer readable signal medium or a computer readable storage medium.
  • the computer readable storage medium includes, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing, such as random access memory (RAM), read only memory (ROM), Erase programmable read-only memory (EPROM or flash memory), optical fiber, portable read-only memory (CD-ROM:).
  • the processor in the computer reads the computer readable program code stored in the computer readable medium, such that the processor can perform the functional actions specified in each step or combination of steps in the flowchart; A device that functions as specified in each block, or combination of blocks.
  • the computer readable program code can execute entirely on the user's computer, partly on the user's computer, as a separate software package, partly on the user's computer and partly on the remote computer, or entirely on the remote computer or computer.
  • the functions noted in the various steps of the flowchart, or in the blocks in the block diagrams may not occur in the order noted.
  • two steps, or two blocks, shown in succession may in fact be executed substantially simultaneously, or the blocks may sometimes be executed in the reverse order.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)
  • Retry When Errors Occur (AREA)
  • Hardware Redundancy (AREA)
  • Stored Programmes (AREA)

Abstract

实施例提供了一种故障处理方法、相关装置及计算机,计算机中的基板管理控制器(12)在确定所述计算机死机时,能够向计算机中的处理器(11)发送读请求消息,所述读请求消息用于请求读取所述处理器(11)记录的第一错误数据,并接收所述处理器(11)返回的读响应消息,根据所述读响应消息,获得所述处理器(11)记录的所述第一错误数据。该实施例无需利用操作系统,通过基板管理控制器(12)就实现了计算机死机后计算机中的错误数据的获取,解决了现有技术中计算机出现严重的不可纠正错误导致系统死机后无法获取计算机中的错误数据的问题。

Description

一种故障处理方法、 相关装置及计算机 技术领域
本发明实施例涉及计算机技术, 特别涉及一种故障处理方法、 相关装置 及计算机。 背景技术
随着信息化技术的大规模发展, 计算机普遍应用于各个领域。 计算机的 故障通常可包括软件故障、 硬件故障、 操作(配置)故障和其他故障。 由于 硬件故障具有复现难、 主要靠人工经验进行判断、 发生错误时问题定位难、 需要多次插拔 /更换等特点, 因此最难以处理的一般是硬件故障, 例如内存、 处理器、 输入输出 (10 )设备等产生的故障。
通常情况下, 硬件故障将导致计算机产生不可纠正错误(Uncorrectable error ), 不可纠正错误不仅可能导致计算机业务的中断, 降低计算机可运行 时间, 甚至可能导致宕机事件。 现有技术中, 对计算机故障进行处理主要通 过以下方法: 当系统中出现不可纠正错误时, 处理器对错误数据进行记录并 通知操作系统( Operating System, OS ); OS在收到通知后抓取处理器记录的 错误数据并打印出来, 以供用户对故障进行分析、 定位和恢复。
现有技术中需要基于 OS实现错误数据的抓取。然而,一旦计算机中出现 严重的不可纠正错误导致计算机死机(在本发明中, 计算机死机是指计算机 出现黑屏、计算机的鼠标或键盘等输入设备无法输入并且计算机的处理器无 法运行计算机指令), OS将不能继续工作, 无法抓取计算机中的错误数据, 导致难以对故障进行分析、 处理和恢复。 发明内容
本发明实施例提出了一种故障处理方法、 相关装置及计算机, 能够在计 算机中出现严重的不可纠正错误导致计算机死机后, 获取计算机中的错误数 据。
第一方面, 本发明实施例提出了一种计算机, 包括处理器和基板管理控 制器, 所述基板管理控制器用于在确定所述计算机死机时, 向所述处理器发 送读请求消息, 所述读请求消息用于请求读取所述处理器记录的第一错误数 据;
所述处理器用于接收所述读请求消息, 并向所述基板管理控制器发送读 响应消息;
所述基板管理控制器用于接收所述处理器返回的所述读响应消息,并根 据所述读响应消息, 获得所述处理器记录的所述第一错误数据。
结合第一方面, 在第一种可能的实现方式中, 所述处理器还用于获取所 述第一错误数据, 并记录所述第一错误数据;
则所述基板管理控制器用于确定所述计算机死机具体为: 所述基板管理 控制器用于接收所述处理器发送的严重故障事件指示, 所述严重故障事件指 示是所述处理器在获取到所述第一错误数据并且所述第一错误数据属于严 重的不可纠正错误类型时发送的; 如果从接收到所述严重故障事件指示开 始, 在预设等待时间内, 未接收到所述处理器发送的至少部分所述第一错误 数据, 则所述基板管理控制器用于确定所述计算机死机。
结合第一方面或第一方面的第一种可能的实现方式,在第二种可能的实 现方式中, 所述基板管理控制器用于根据所述读响应消息, 获得所述处理器 记录的所述第一错误数据具体为: 当所述读响应消息中携带所述第一错误数 据时, 所述基板管理器用于从所述读响应消息中获得所述处理器记录的所述 第一错误数据。
结合第一方面或第一方面的第一种可能的实现方式,在第三种可能的实 现方式中, 所述基板管理控制器用于根据所述读响应消息, 获得所述处理器 记录的所述第一错误数据具体为: 当所述读响应消息中携带读失败指示时, 所述基板管理控制器用于指示所述计算机的热重启模块或者用户对所述计 算机进行热重启; 其中, 所述读失败指示用于指示从所述处理器中读取所述 第一错误数据失败, 以使得所述处理器在所述计算机热重启时, 执行所述计 算机的基本输入输出系统的故障收集指令,根据所述基本输入输出系统的故 障收集指令, 获取所述第一错误数据, 并发送给所述基板管理控制器; 所述 基板管理控制器用于接收所述处理器发送的所述第一错误数据。
结合第一方面或第一方面的第一至第三任一可能的实现方式,在第四种 可能的实现方式中, 所述基板管理控制器还用于根据故障解析机制, 对所述 第一错误数据进行解析 , 得到所述第一错误数据的故障解析信息。
结合第一方面的第四种可能的实现方式, 在第五种可能的实现方式中, 所述基板管理控制器还用于根据预设的故障处理机制,对所述第一错误数据 的故障解析信息进行分析, 得到故障处理建议。
结合第一方面的第五种可能的实现方式, 在第六种可能的实现方式中, 所述基板管理控制器在确定所述计算机死机之前,还用于接收所述处理器发 送的第二错误数据, 并根据所述故障解析机制, 对所述第二错误数据进行解 析, 得到所述第二错误数据的故障解析信息, 其中, 所述第二错误数据为所 述计算机产生所述第一错误数据之前预设时间内产生的错误数据;
则, 所述基板管理控制器用于根据预设的故障处理机制, 对所述第一错 误数据的故障解析信息进行分析, 得到故障处理建议包括: 所述基板管理控 制器用于根据所述预设的故障处理机制,对所述第二错误数据的故障解析信 息和所述第一错误数据的故障解析信息进行分析, 得到所述故障处理建议。
第二方面, 本发明实施例提出了一种故障处理方法, 用于包括基板管理 控制器和处理器的计算机, 该方法包括:
所述基板管理控制器在确定所述计算机死机时, 向所述处理器发送读请 求消息, 所述读请求消息用于请求读取所述处理器记录的第一错误数据; 所述基板管理控制器接收所述处理器返回的读响应消息,并根据所述读 响应消息, 获得所述处理器记录的所述第一错误数据。
结合第二方面, 在第一种可能的实现方式中, 所述基板管理控制器接收 所述处理器发送的严重故障事件指示, 所述严重故障事件指示是所述处理器 在获取到所述第一错误数据并且所述第一错误数据属于严重的不可糾正错 误类型时发送的; 如果从接收到所述严重故障事件指示开始, 在预设等待时 间内, 未接收到所述处理器发送的至少部分所述第一错误数据, 则确定所述 计算机死机。
结合第二方面或第二方面的第一种可能的实现方式,在第二种可能的实 现方式中, 所述基板管理控制器接收所述处理器返回的读响应消息, 并根据 所述读响应消息, 获得所述处理器记录的所述第一错误数据包括: 所述基板 管理控制器在所述读响应消息中携带所述第一错误数据时,从所述读响应消 息中获得所述处理器记录的所述第一错误数据。
结合第二方面或第二方面的第一种可能的实现方式,在第三种可能的实 现方式中, 所述基板管理控制器接收所述处理器返回的读响应消息, 并根据 所述读响应消息, 获得所述处理器记录的所述第一错误数据包括: 所述基板 管理控制器在所述读响应消息中携带读失败指示时,指示所述计算机的热重 启模块或者用户对所述计算机进行热重启, 以使得所述处理器在所述计算机 热重启时, 执行所述计算机的基本输入输出系统的故障收集指令, 根据所述 基本输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送给所述 基板管理控制器; 其中, 所述读失败指示用于指示从所述处理器中读取所述 第一错误数据失败; 所述基板管理控制器接收所述处理器发送的所述第一错 误数据。
结合第二方面或第二方面的第一至第三任一可能的实现方式,在第四种 可能的实现方式中, 在所述基板管理控制器根据所述读响应消息, 获得所述 处理器记录的所述第一错误数据之后, 所述方法还包括: 所述基板管理控制 器根据故障解析机制, 对所述第一错误数据进行解析, 得到所述第一错误数 据的故障解析信息。
结合第二方面的第四种可能的实现方式, 在第五种可能的实现方式中, 所述方法还包括: 所述基板管理控制器根据预设的故障处理机制, 对所述第 一错误数据的故障解析信息进行分析, 得到故障处理建议。
结合第二方面的第五种可能的实现方式, 在第六种可能的实现方式中, 在所述基板管理控制器确定所述计算机死机之前, 所述方法还包括: 所述基 板管理控制器接收所述处理器发送的第二错误数据; 其中, 所述第二错误数 据为所述计算机产生所述第一错误数据之前预设时间内产生的错误数据; 则, 所述基板管理控制器根据预设的故障处理机制, 对所述第一错误数 据的故障解析信息进行分析, 得到故障处理建议包括:
所述基板管理控制器根据所述故障解析机制,对所述第二错误数据进行 解析, 得到所述第二错误数据的故障解析信息, 并根据所述预设的故障处理 机制,对所述第二错误数据的故障解析信息和所述第一错误数据的故障解析 信息进行分析, 得到所述故障处理建议。
第三方面, 本发明实施例提出了一种基板管理控制器, 包括: 发送单元, 用于在确定所述计算机死机时, 向所述处理器发送读请求消 息, 所述读请求消息用于请求读取所述处理器记录的第一错误数据;
接收单元, 用于接收所述处理器返回的读响应消息, 并根据所述读响应 消息, 获得所述处理器记录的所述第一错误数据。
结合第三方面, 在第一种可能的实现方式中, 所述基板管理控制器还包 括: 确定单元, 用于接收所述处理器发送的严重故障事件指示, 所述严重故 障事件指示是所述处理器在获取到所述第一错误数据并且所述第一错误数 据属于严重的不可纠正错误类型时发送的; 如果从接收到所述严重故障事件 指示开始, 在预设等待时间内, 未接收到所述处理器发送的至少部分所述第 一错误数据, 则确定所述计算机死机。
结合第三方面或第三方面的第一种可能的实现方式,在第二种可能的实 现方式中, 所述接收单元接收所述处理器返回的读响应消息, 并根据所述读 响应消息, 获得所述处理器记录的所述第一错误数据包括: 所述接收单元在 所述读响应消息中携带所述第一错误数据时,从所述读响应消息中获得所述 处理器记录的所述第一错误数据。
结合第三方面或第三方面的第一种可能的实现方式,在第三种可能的实 现方式中, 所述接收单元接收所述处理器返回的读响应消息, 并根据所述读 响应消息, 获得所述处理器记录的所述第一错误数据包括:
所述接收单元在所述读响应消息中携带读失败指示时,指示所述计算机 的热重启单元或者用户对所述计算机进行热重启, 以使得所述处理器在所述 计算机热重启时, 执行所述计算机的基本输入输出系统的故障收集指令, 根 据所述基本输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送 给所述接收单元; 其中, 所述读失败指示用于指示从所述处理器中读取所述 第一错误数据失败; 所述接收单元接收所述处理器发送的所述第一错误数 据。
结合第三方面或者第三方面的第一至第三任一可能的实现方式,在第四 中可能的实现方式中, 所述基板管理控制器还包括: 故障处理单元, 用于根 据故障解析机制, 对所述第一错误数据进行解析, 得到所述第一错误数据的 故障解析信息。
结合第三方面的第四种可能的实现方式, 在第五种可能的实现方式中, 故障解析信息进行分析, 得到故障处理建议。
结合第三方面的第五种可能的实现方式, 在第六种可能的实现方式中, 所述接收单元还用于接收所述处理器发送的第二错误数据; 所述故障处理单 元还用于根据所述故障解析机制, 对所述第二错误数据进行解析, 得到所述 第二错误数据的故障解析信息; 其中, 所述第二错误数据为所述计算机产生 所述第一错误数据之前预设时间内产生的错误数据; 则, 所述故障处理单元 用于根据预设的故障处理机制,对所述第一错误数据的故障解析信息进行分 析, 得到故障处理建议包括: 所述故障处理单元根据所述预设的故障处理机 制,对所述第二错误数据的故障解析信息和所述第一错误数据的故障解析信 息进行分析, 得到所述故障处理建议。
第四方面, 本发明实施例提出了一种基板管理控制器, 所述基板管理控 制器包括处理器、 存储器、 总线和通信接口;
所述存储器用于存储计算机执行指令, 所述处理器与所述存储器通过所 述总线连接, 当所述基板管理控制器运行时, 所述处理器执行所述存储器存 储的所述计算机执行指令, 以使所述基板管理控制器执行第二方面所述的故 障处理方法, 或者第二方面任一可能的实现方式所述的故障处理方法。
第五方面, 本发明实施例提出了一种计算机可读介质, 包括计算机执行 指令, 以供计算机的处理器执行所述计算机执行指令时, 所述计算机执行第 二方面所述的故障处理方法, 或者第二方面任一可能的实现方式所述的故障 处理方法。
在本发明实施例中,计算机中的基板管理控制器可以在确定所述计算机 死机时, 向计算机中处理器发送读请求消息, 所述读请求消息用于请求读取 所述处理器记录的第一错误数据, 并接收所述处理器返回的读响应消息, 根 据所述读响应消息, 获得所述处理器记录的所述第一错误数据。 上述方式无 需利用操作系统, 只需通过基板管理控制器就实现了计算机死机后计算机中 的错误数据的获取,解决了现有技术中计算机出现严重的不可纠正错误导致 系统死机后无法获取计算机中的错误数据的问题。 附图说明
为了更清楚地说明本发明实施例的技术方案, 下面将对现有技术或实施 例中所需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅 是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳 动的前提下, 还可以根据这些附图获得其他的附图。
图 1是本发明实施例提供的一种计算机的示意图;
图 2是本发明实施例提供的又一种计算机的示意图;
图 3是本发明实施例提供的一种故障处理方法的方法流程图; 图 4是本发明实施例提供的又一种故障处理方法的方法流程图; 图 5是本发明实施例提供的基板管理控制器的示意图;
图 6是本发明实施例提供的又一种基板管理控制器的组成结构示意图; 具体实施方式
本发明实施例提出了一种故障处理方法、 相关装置及计算机, 能够在计 算机中出现严重的不可纠正错误导致计算机死机后, 获取计算机中的错误数 据。
需要注意的是, 本发明的说明书和权利要求书及说明书附图中的术语 "第一" 和 "第二" 是用于区别类似的对象, 而不必用于描述特定的顺序或 先后次序。 应该理解这样使用的数据在适当情况下可以互换。 本发明的说明 书和权利要求书及说明书附图的计算机死机是指计算机出现黑屏、计算机的 处理器无法运行计算机指令并且计算机的鼠标或键盘等输入设备无法输入。
实施例一
如图 1所示为本发明实施例提供的计算机的示意图, 该计算机包括处理 器 11和基板管理控制器 12 ( Baseboard Management Controller, BMC ) ; 所述基板管理控制器 12用于在确定所述计算机死机时, 向所述处理器 11发送读请求消息, 所述读请求消息用于请求读取所述处理器 11记录的第 一错误数据; 其中, 所述第一错误数据为所述计算机中产生的错误数据, 可 以是所述计算机中产生的所有错误数据,还可以是所述计算机中产生的部分 错误数据, 例如所述第一错误数据可以是所述计算机死机前 2秒内产生的错 误数据, 本发明实施例在此不作限定。
所述处理器 11用于接收所述读请求消息, 并向所述基板管理控制器 12 发送读响应消息; 此时虽然所述计算机已经死机, 所述处理器无法执行任何 计算机指令, 但是所述处理器可以接收并响应所述读请求消息。
所述基板管理控制器 12用于接收所述处理器 11 返回的所述读响应消 息, 并根据所述读响应消息, 获得所述处理器 11记录的所述第一错误数据。 例如, 所述处理器 11 可以将第一错误数据记录在自身的寄存器中, 所 述基板管理控制器 12可以利用所述寄存器的地址, 向所述处理器 11发送读 请求消息, 以从所述寄存器中获取所述第一错误数据; 虽然所述计算机已经 死机, 无法运行计算机指令, 但是所述处理器 11 的所述寄存器可以响应所 述读请求消息, 返回读响应消息, 例如返回所述第一错误数据, 从而所述基 板管理控制器 12可以根据所述读响应消息, 获得所述第一错误数据。 值得 注意的是, 在本发明实施例中, 所述第一错误数据可能包括一个或多个错误 数据, 本发明实施例在此不作限定。
在本发明实施例中,基板管理控制器 12可以在确定所述计算机死机时, 向处理器 11发送读请求消息, 所述读请求消息用于请求读取所述处理器 11 记录的第一错误数据, 并接收所述处理器 11返回的读响应消息, 根据所述 读响应消息, 获得所述处理器 11 记录的所述第一错误数据。 本发明实施例 无需利用操作系统, 只需通过基板管理控制器就实现了计算机死机后计算机 中的错误数据的获取,解决了现有技术中计算机出现严重的不可纠正错误导 致系统死机后无法获取计算机中的错误数据的问题。
下面对本发明实施例展开进行详细地介绍。
( 1 ) 关于如何确定计算机死机
通常情况下, 可以将计算机故障引起的不可纠正错误 (Uncorrectable error )分为灾难性错误 ( Catastrophic Error )、 致命错误( Fatal Error )和可恢 复错误( Recoverable Error )。 其中, 灾难性错误和致命错误最为严重, 可能 导致计算机出现蓝屏、 紫屏甚至死机(例如黑屏和挂死)。 因此, 可以对计 算机中的灾难性错误或致命错误进行监控, 例如对内部错误 ( Internal Error, IERR; 属于灾难性错误)或机械检查错误(Machine Check Error, MCERR; 属于致命错误)进行监控, 当计算机中出现灾难性错误或致命错误时, 如果 计算机无法运行基本输入输出系统( Basic Input Output System, BIOS ) 的指 令, 或者操作系统(Operating System, OS ) 的指令, 则可以确定计算机死 机。
具体地, 所述处理器 11还可以用于获取所述第一错误数据, 并记录所 述第一错误数据, 例如所述处理器 11 可以产生或接收所述第一错误数据, 并将所述第一错误数据记录在所述计算机的緩存中或者所述处理器 11 的寄 存器中或者其它具备存储能力的模块中; 一方面, 在所述处理器 11 获取到 所述第一错误数据之后, 如果所述计算机没有死机, 所述处理器 11 可以将 所述第一错误数据发送给所述基板管理控制器 , 例如预先将所述基本输入输 出系统的错误收集指令配置在所述计算机中, 如果所述计算机没有死机, 所 述处理器 11执行所述基本输入输出系统的错误收集指令, 根据所述基本输 入输出系统的错误收集指令,将所述第一错误数据发送给所述基板管理控制 器 12, 如果所述计算机死机, 所述处理器 11无法执行任何计算机指令; 另 一方面, 在所述处理器 11 获取到所述第一错误数据之后, 如果所述第一错 误数据属于严重的不可纠正错误类型时, 所述处理器 11还可以发送严重故 障事件指示, 以通知所述基板管理控制器 12所述计算机产生了灾难性错误 或致命错误可能引起死机, 其中所述第一错误数据属于严重的不可纠正错误 类型是指所述第一错误数据属于灾难性错误或致命错误; 则所述基板管理控 制器 12可以用于接收所述处理器 11发送的严重故障事件指示, 如果从接收 到所述严重故障事件指示开始,在预设等待时间内, 未接收到所述处理器 11 发送的至少部分所述第一错误数据, 则所述基板管理控制器 12可以确定所 述计算机死机。
此外, 所述基板管理控制器 12还可以根据用户的指示确定所述计算机 死机, 例如用户可以在发现所述计算机死机时通知所述基板管理控制器 12, 所述基板管理控制器 12可以根据用户的指示确定所述计算机死机, 从而启 动对所述第一错误数据的获取。
( 2 ) 关于所述第一错误数据的获取
所述处理器 11在接收到所述读请求消息时, 可以根据所述读请求消息, 将所述第一错误数据携带在所述读响应消息中 ,返回给所述基板管理控制器
12, 此时所述基板管理控制器 12读取数据成功, 则所述基板管理控制器 12 可以从所述读响应消息中获得所述处理器 11记录的所述第一错误数据。
然而, 在某些硬件故障引发不可纠正错误导致计算机死机时, 所述基板 管理控制器 12可能无法成功读取所述第一错误数据, 所述读响应消息携带 读失败指示, 所述读失败指示用于指示从所述处理器 11 中读取所述第一错 误数据失败, 则所述基板管理控制器 12可以用于指示所述计算机的热重启 模块或者用户对所述计算机进行热重启, 以使得所述处理器 11 在所述计算 机热重启时, 执行所述计算机的基本输入输出系统的故障收集指令, 根据所 述基本输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送给所 述基板管理控制器 12;所述基板管理控制器 12可以接收所述处理器 11发送 的所述第一错误数据, 完成所述第一错误数据的获取。
值得注意的是, 计算机重启可以分为热重启和冷重启, 冷重启会对所述 计算机下电, 对计算机进行初始化, 冷重启后可能造成信息的丟失, 例如冷 重启后处理器中寄存器保存的信息将会丟失,按重启电源开关属于对计算机 进行冷重启; 而热重启不同于冷重启, 不会对计算机下电, 不会对计算机进 行初始化,处理器中寄存器保存的信息不会丟失,从开始处点击 "重新启动" 按正常程序关毕和启动计算机属于热重启; 在本发明实施例以及后续的实施 例中, 对计算机进行热重启均是指上述含义。
此外, 所述基板管理控制器 12还可以用于在获取到所述第一错误数据 之后, 向所述处理器 11发送清除数据消息, 以指示所述处理器 11删除自身 记录的所述第一错误数据, 避免存储资源的浪费。
可选地, 所述基板管理控制器 12还可以用于在接收所述处理器 11发送 的严重故障事件指示后, 向所述计算机的故障告警模块发送告警消息或进行 打印操作, 以将所述严重故障告警事件通知用户使得用户及时获知计算机故 障。 ( 3 ) 关于故障的分析、 定位和处理
在现有技术中,通常只能够将计算机没有死机情况下的错误数据打印出 来,没有完整的故障记录,并且只能依靠人工进行故障的分析、定位和处理。 在本发明实施例中, 可以通过基板管理控制器 12记录完整的故障记录, 还 可以自动定位故障源并给出故障处理建议, 为故障的及时处理和恢复提供了 帮助, 具体方案如下:
所述处理器 11 记录的第一错误数据通常情况下为用 "0" 或 "1" 表示 的信息, 因此, 所述基板管理控制器 12还可以用于根据故障解析机制, 对 所述第一错误数据进行解析, 得到所述第一错误数据的故障解析信息, 所述 第一错误数据的故障解析信息可以包括所述第一错误数据中每个错误数据 的产生时间、 谁收集的该错误数据、 该错误数据来自哪个处理器、 哪个核
( Core )、 属于什么错误等; 例如, 如果是 X86的计算机, 所述基板管理控 制器 12可以根据 Intel的故障代码定义对二进制形式的所述第一错误数据进 行解析, 得到故障解析信息。 所述故障解析信息不仅可以提供给维修人员或 用户去了解故障情况, 还可以用于后续的故障定位、 分析和处理。
所述基板管理控制器 12还可以用于根据预设的故障处理机制, 对所述 第一错误数据的故障解析信息进行分析, 得到故障处理建议。 所述预设的故 建议可以包括故障定位信息和 /或处理建议信息 ,从而用户或者故障维修人员 可以根据所述故障处理建议, 对所述计算机进行处理, 以恢复所述计算机。 进一步, 由于所述第一错误数据可能只是所述计算机死机之前很短一段时间 内产生的错误数据, 例如, 所述第一错误数据为所述计算机死机之前 0.5秒 内产生的错误数据, 因此, 为了提高故障定位和分析的准确度, 可以对更多 的错误数据的故障解析信息进行分析。 具体地, 所述基板管理控制器 12在 确定所述计算机死机之前, 还可以接收所述处理器 11发送的第二错误数据, 所述第二错误数据不同于所述第一错误数据, 所述第二错误数据为所述计算 机产生所述第一错误数据之前预设时间内产生的错误数据; 所述基板管理控 制器 12可以根据所述故障解析机制, 对所述第二错误数据进行解析, 得到 所述第二错误数据的故障解析信息, 并根据预设的故障处理机制, 对所述第 二错误数据的故障解析信息和所述第一错误数据的故障解析信息进行分析, 得到所述故障处理建议。 例如所述第一错误数据可以是所述计算机死机前 0.5秒内产生的错误数据, 当所述预设时间为 4.5秒时,所述第二错误数据可 以是所述计算机死机前 5秒至死机前 0.5内产生的错误数据, 则所述基板管 理控制器 12可以根据预设的故障处理机制, 对所述计算机死机前 5秒内的 错误数据的故障解析信息进行分析, 得到所述故障处理建议。
进一步, 所述基板管理控制器 12还可以用于打印所述第一错误数据的 故障解析信息或所述第一错误数据的故障解析信息或所述故障处理建议,从 而用户或者故障维修人员可以根据打印的信息, 处理所述计算机的故障。
进一步, 所述基板管理控制器 12还可以将所述第一错误数据的故障解 析信息、 所述第二错误数据的故障解析信息、 所述第一错误数据和所述第二 错误数据中的至少一种保存到所述计算机的故障信息库,得到所述计算机的 故障记录, 从而为后续故障的定位和恢复提供帮助, 例如, 所述基板管理控 制器 12 可以将所述第一错误数据的故障解析信息和所述第二错误数据的故 障解析信息保存到所述故障信息库,从而所述故障信息库中保存了完整的错 误数据, 可以提供完整的故障记录; 在本发明实施例中, 所述故障信息库可 以设置于所述基板管理控制器 12中, 也可以设置在所述基板管理控制器 12 夕卜。
需要注意的是, 在实际的应用过程中, 对计算机的故障进行定位、 分析 和处理, 可以 居不同的应用场景, 釆用不同的方式。 例如对于非单机的应 用场景而言, 系统中可以包括多个本发明实施例所述的计算机, 每个本发明 实施例所述的计算机可能都具备故障定位、 分析和处理的能力, 此时可以让 多个所述计算机中的一个计算机(例如主计算机)的基板管理控制器从其它 计算机的基板管理控制器收集错误数据, 由其对系统中的所有计算机进行统 一的故障定位、 分析和处理; 或者系统中的多个所述计算机的基本管理控制 器可以将自身得到的错误数据上报给系统中的管理设备 (如管理服务器 ), 由管理设备利用本法实施例所述的方式对系统中的所有计算机进行统一的 故障定位、 分析和处理。
本发明实施例无需利用操作系统, 只需通过基板管理控制器 12就实现 了计算机死机后计算机中的错误数据的获取,解决了现有技术中计算机出现 严重的不可纠正错误导致系统死机后无法获取计算机中的错误数据的问题。 此外, 所述基板管理控制器 12还可以在所述故障记录库中记录完整的故障 记录, 还可以对所述第一错误数据进行解析, 并根据预设的故障处理机制, 对所述第一错误数据的故障解析信息进行分析, 定位故障源并给出处理建 议。
实施例二
为了更好的说明本发明,在下文的具体实施方式中给出了众多的具体细 节。 本领域技术人员应当理解, 没有某些具体细节, 本发明同样可以实施。 在本发明实施例中, 将结合图 2对实施例一所述的处理器 11和基板管理控 制器 12的结构组成和功能进行详细地介绍。
如图 2所示为本发明实施例提供的计算机的组成结构示意图,该计算机 由处理器 11和基板管理控制器 12组成; 所述处理器 11可以包括记录模块 21、存储模块 22和指令执行模块 23; 所述记录模块 21具体可以是负责处理 器 11各内部功能模块的硬件故障检查架构 (Machine Check Architecture,
MCA ) , 和 /或负责计算机输入输出设备的 PCIe 规范的故障报告机制 ( Advanced Error Reporting, AER ); 相应地, 所述存储模块 22可以是 MCA 的寄存器和 /或 AER的寄存器,所述 MCA的寄存器和所述 AER的寄存器可 以位于所述处理器 11 内部; 所述指令执行模块 23可以为所述处理器 11的 内核, 用于执行基本输入输出系统的指令和操作系统的指令; 所述记录模块 21可以用于获取所述计算机中的错误数据, 例如生成处 理器 11中各内部功能模块发生故障产生错误数据,又例如接收 10设备发生 故障产生错误数据, 所述计算机中的错误数据包括但不限于本发明实施例中 所述第一错误数据和所述第二错误数据, 所述记录模块 21 可以将所述获取 到的所述计算机中的错误数据记录在存储模块 22 中; 具体地, 如果所述计 算机中的错误数据为 MCA获取的, 则所述 MCA可以将所述计算机中的错 误数据记录在所述 MCA的寄存器中,如果所述计算机中的错误数据为 AER 获取的,则所述 AER可以将所述计算机的错误数局记录到 AER的寄存器中, 其中, MCA或 AER获取的错误数据的范围可以通过 BIOS对相应的寄存器 进行配置来实现; 可选地,所述 MAC或者 AER在将所述计算机中的错误数 据记录到相应的寄存器之时 /后 ,还可以将记录所述计算机中的错误数据的寄 存器的地址保存在第一寄存器中, 以便后续所述指令执行模块 23可以根据 所述基本输入输出系统的错误收集指令, 利用所述第一寄存器中记录的地 址, 获取所述计算机中的错误数据。
所述记录模块 21还可以在获取到所述计算机中的错误数据时, 触发系 统管理中断( System Management Interrupt, SMI ); 所述系统管理中断用于 触发所述指令执行模块 23执行所述基本输入输出系统的错误收集指令, 如 果所述计算机没有死机, 则所述指令执行模块 23可以根据所述基本输入输 出系统的错误收集指令, 从所述存储模块 22 中获取所述计算机中的错误数 据, 并发送给所述基板管理控制器 12, 如果所述计算机死机, 则所述指令执 行模块 23无法执行任何计算机指令; 其中, 所述基本输入输出系统的错误 收集指令可以预先配置在存储所述基板输入输出系统的指令的存储器中。
实际上, 根据实施例一可知, 所述第二错误数据是所述计算机产生所述 第一错误数据之前预设时间内产生的错误数据, 因此所述记录模块 21将会 先获取到所述第二错误数据, 再获取到所述第一错误数据; 则所述记录模块 21在获取到所述第二错误数据时,一方面可以将所述第二错误数据记录到所 述存储模块 22 ,另一方面可以触发系统管理中断;如果所述计算机没有死机, 所述指令执行模块 23可以根据所述系统管理中断, 执行所述基本输入输出 系统的错误收集指令, 根据所述基本输入输出系统的错误收集指令, 从所述 存储模块 22中获取所述第二错误数据发送给所述基板管理控制器 12; 可选 地, 所述指令执行模块 23可以通过智慧平台管理接口 ( Intelligent Platform Management Interface, IPMI )标准将所述第二错误数据发送给所述基板管理 控制器 12, 所述基板管理控制器 12可以通过 IPMI标准接收所述指令执行 模块 23发送的所述第二错误数据; 值得注意的是, 当所述第二错误数据中 包括多个错误数据, 所述记录模块 21通过多次才能获取到所述第二错误数 据时, 所述记录模块 21可以在每次获取到部分所述第二错误数据时都触发 所述系统管理中断, 相应地, 所述指令执行模块 23可以通过多次执行所述 基本输入输出系统的错误收集指令,将所述第二错误数据分多次发生给所述 基板管理控制器 12; 可选地, 在所述指令执行模块 23将所述第二错误数据 发送给所述基板管理控制器 12之后, 可以执行操作系统的删除指令, 根据 所述操作系统的删除指令, 删除所述记录模块 21 中保存的所述第二错误数 据; 换而言之, 所述指令执行模块 23可以将已经发送给所述基板管理控制 器 12的错误数据从所述存储模块 22中删除,避免将错误数据重复发送给所 述基板管理控制器 12。
当所述记录模块 21在获取到所述第二错误数据之后, 如果获取到所述 第一错误数据, 也可以触发所述系统管理中断; 进一步, 如果所述第一错误 数据属于严重的不可纠正错误类型时, 即所述第一错误数据为灾难性错误或 致命错误时, 所述记录模块 21还可以触发严重故障事件指示, 以通知所述 基板管理控制器 12所述计算机产生了灾难性错误或致命错误可能引起死机; 当所述第一错误数据真的属于严重的不可纠正错误类型 , 并且所述计算机的 死机了, 则所述指令执行模块 23将无法执行计算机指令, 即使所述记录模 块 21触发了系统管理中断, 所述指令执行模块 23也无法执行所述基本输入 输出系统的错误收集指令, 无法从所述存储模块 22 中获取所述第一错误数 据给所述基板管理控制器 12; 因此所述基板管理控制器 12如果从接收到所 述严重故障事件指示开始, 在预设等待时间内, 未接收到所述处理器 11发 送的至少部分所述第一错误数据, 则可以确定所述计算机死机。 具体地, 所 述记录模块 21 触发严重故障事件指示可以通过改变引脚 CATEER— N 或 ERROR— N 的电平实现, 则所述基板管理控制器 12 可以通过接收引脚 CATEER— N或 ERROR— N的电平信号, 来接收所述严重故障事件指示。
所述基板管理控制器 12在确定所述计算机死机时, 可以向所述记录模 块 21发送读请求消息, 所述读请求消息用于请求读取所述第一错误数据; 所述记录模块 21在所述计算机死机之后, 仍然可以接收所述读请求消息, 并向所述基板管理控制器 12发送读响应消息; 从而所述基板管理控制器 12 可以接收所述读响应消息, 并根据所述读响应消息, 获得所述处理器 11 记 录的所述第一错误数据; 具体地, 所述基板管理控制器 12可以通过平台环 境式控制接口 ( Platform Environment Control Interface, PECI )总线遍历所述 MAC的寄存器或所述 AER的寄存器, 以从所述 MAC的寄存器或所述 AER 的寄存器中读取所述第一错误数据; 如果所述基板管理控制器 12 从所述 MAC的寄存器或所述 AER的寄存器中读取数据成功, 则所述 MAC的寄存 器或所述 AER的寄存器返回的读响应消息中携带所述第一错误数据, 所述 基板管理控制器 12可以获取所述第一错误数据; 如果所述基板管理控制器 12从所述 MAC 的寄存器或所述 AER 的寄存器中读取数据失败, 则所述 MAC的寄存器或所述 AER的寄存器返回的读响应消息中携带读失败指示, 例如乱码, 则所述基板管理控制器 12可以指示所述计算机的热重启模块或 者用户对所述计算机进行热重启, 以使得所述指令执行模块 23在所述计算 机热重启时, 执行所述基本输入输出系统的故障收集指令, 根据所述基本输 入输出系统的故障收集指令,遍历所述 MAC的寄存器或所述 AER的寄存器, 获取所述第一错误数据, 并通过 IPMI标准发送给所述基板管理控制器 12, 所述基板管理控制器 12可以接收所述故障收集指令发送的所述第一错误数 据。
在本发明实施例中,所述基板管理控制器 12通过与所述处理器 11配合, 实现了计算机死机后计算机中的错误数据的获取,解决了现有技术中计算机 出现严重的不可纠正错误导致系统死机后无法获取计算机中的错误数据的 问题。
实施例三
本发明实施例提供了一种故障处理方法,用于图 1或图 2所示的计算机, 该计算机包括基板管理控制器和处理器, 所述方法包括:
S301 :基板管理控制器在确定计算机死机时,向处理器发送读请求消息, 所述读请求消息用于请求读取所述处理器记录的第一错误数据。
所述处理器可以获取所述第一错误数据, 并记录所述第一错误数据。 所 述基板管理控制器在确定所述计算机死机时, 可以向所述处理器发送读请求 消息, 以读取所述处理器记录的第一错误数据。 此时虽然所述计算机已经死 机, 所述处理器无法执行任何计算机指令, 但是所述处理器可以接收并响应 所述读请求消息, 从而所述基板管理控制器可以获取所述第一错误数据; 例 如所述处理器可以将所述第一错误数据记录在自身的寄存器中, 则所述基板 管理控制器可以将所述读请求消息发送给所述处理器的寄存器, 所述处理器 的寄存器可以接收所述读请求消息,并返回读响应消息。在本发明实施例中, 所述第一错误数据可能包括一个或多个错误数据, 本发明实施例在此不作限 定。
所述基板管理控制器确定所述计算机死机有多种方式,具体地可以参考 实施例一或实施例二, 本发明实施例在此不再赘述。
S302: 所述基板管理控制器接收所述处理器返回的读响应消息, 并根据 所述读响应消息, 获得所述处理器记录的所述第一错误数据。
如果所述基板管理控制器从所述处理器读取数据成功, 则所述读响应消 息可能携带所述第一错误数据, 则所述基板管理控制器可以从所述读响应消 息中获得所述处理器记录的所述第一错误数据; 如果所述基板管理控制器从 苏搜处理器读取数据失败, 则所述读响应消息可能携带读失败指示, 则所述 基板管理控制器可以通过其它方式获取所述第一错误数据, 例如可以预先在 所述计算机中配置所述基本输入输出系统的故障收集指令, 当所述读响应消 息携带读失败指示时, 所述基板管理控制器可以指示所述计算机的热重启模 块或者用户对所述计算机进行热重启, 以使得所述处理器在所述计算机热重 启时, 执行所述计算机的基本输入输出系统的故障收集指令, 根据所述基本 输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送给所述基板 管理控制器, 则所述基板管理控制器可以通过接收所述处理器发送的所述第 一错误数据, 完成所述第一错误数据的获取。
在本发明实施例中,计算机的基板管理控制器可以在确定所述计算机死 机时, 向所述计算机的处理器发送读请求消息, 所述读请求消息用于请求读 取所述处理器记录的第一错误数据, 接收所述处理器返回的读响应消息, 根 据所述读响应消息, 获得所述处理器记录的所述第一错误数据。 本发明实施 例无需利用操作系统, 只需通过基板管理控制器就实现了计算机死机后计算 机中的错误数据的获取,解决了现有技术中计算机出现严重的不可纠正错误 导致系统死机后无法获取计算机中的错误数据的问题。
实施例四
本发明实施例提供了一种故障处理方法,用于图 1或图 2所示的计算机, 该计算机包括基板管理控制器和处理器, 所述方法包括:
S401 : 基板管理控制器接收处理器发送的严重故障事件指示, 所述严重 故障事件指示是所述处理器在获取到第一错误数据并且所述第一错误数据 属于严重的不可纠正错误类型时发送的。
S402:所述基板管理控制器向所述计算机的故障告警模块发送告警消息 或进行打印操作, 以将所述严重故障告警事件通知用户。 所述基板管理控制器可以在接收到所述处理器发送的严重故障事件指 示后, 通过告警消息触发故障告警传感器或者进行打印操作, 以通知用户所 述计算机产生了严重故障可能导致死机。 在本发明实施例中, S402 为可选 步骤。
S403 : 如果从接收到所述严重故障事件指示开始, 在预设等待时间内, 所述基板管理控未接收到所述处理器发送的至少部分所述第一错误数据,确 定所述计算机死机, 执行步骤 S404。
所述处理器在获取到所述第一错误数据之后, 如果所述计算机没有死 机, 则所述处理器可以执行所述基本输入输出系统的错误收集指令, 根据所 述基本输入输出系统的错误收集指令, 将所述第一错误数据发送给所述基板 管理控制器;如果所述计算机死机,则所述处理器无法执行任何计算机指令。 因此, 如果从接收到所述严重故障事件指示开始, 在预设等待时间内, 所述 基板管理控制器未接收到所述处理器发送的至少部分所述第一错误数据, 可 以确定所述计算机死机。
S404: 所述基板管理控制器向所述处理器发送读请求消息, 所述读请求 消息用于请求读取所述处理器记录的第一错误数据。
在确定所述计算机死机之后, 所述基板管理控制器可以向处理器获取所 述第一错误数据,以实现所述计算机死机后,对计算机中的错误数据的获取。
S405: 所述基板管理控制器接收所述处理器返回的读响应消息, 并根据 所述读响应消息, 获得所述处理器记录的所述第一错误数据。
所述基板管理控制器根据所述读响应消息,获得所述处理器记录的所述 第一错误数据,具体可以是 S405a所述的方式,也可以是 S405a所述的方式。
S405a: 如果所述读响应消息中携带所述第一错误数据, 则所述基板管 理控制器从所述读响应消息中获得所述处理器记录的所述第一错误数据。
如果所述读响应消息中携带所述第一错误数据,表明所述基板管理控制 器从所述处理器读取所述第一错误数据成功, 所述基板管理控制器可以从所 述读响应消息中获得所述处理器记录的所述第一错误数据。
S405b: 如果所述读响应消息中携带读失败指示, 所述读失败指示用于 指示从所述处理器中读取所述第一错误数据失败, 则所述基板管理控制器指 示所述计算机的热重启模块或者用户对所述计算机进行热重启, 以使得所述 处理器在所述计算机热重启时,执行所述计算机的基本输入输出系统的故障 收集指令, 根据所述基本输入输出系统的故障收集指令, 获取所述第一错误 数据, 并发送给所述基板管理控制器; 所述基板管理控制器接收所述处理器 发送的所述第一错误数据。
可以预先将所述基本输入输出系统的故障收集指令配置在所述计算机 中, 当所述基板管理控制器从所述处理器中读取所述第一错误数据失败时, 所述读响应消息中携带读失败指示, 所述基板管理控制器指示所述计算机的 热重启模块或者用户对所述计算机进行热重启, 以使得所述处理器在所述计 算机热重启时, 执行所述计算机的基本输入输出系统的故障收集指令, 根据 所述基本输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送给 所述基板管理控制器。
S406: 所述基板管理控制器根据故障解析机制, 对所述第一错误数据进 行解析, 得到所述第一错误数据的故障解析信息。
在所述基板管理控制器获取到所述第一错误数据之后, 所述基板管理控 制器根据故障解析机制, 对所述第一错误数据进行解析, 得到所述第一错误 数据的故障解析信息, 所述第一错误数据的故障解析信息可以包括所述第一 错误数据中每个错误数据的产生时间、 谁收集的该错误数据、 该错误数据来 自哪个处理器、 哪个核 (Core )、 属于什么错误等; 所述故障解析信息不仅 可以提供给维修人员或用户去了解故障情况, 还可以用于后续的故障定位、 分析和处理。
S407: 所述基板管理控制器根据预设的故障处理机制, 对所述所述第一 错误数据的故障解析信息进行分析, 得到所述故障处理建议。 述基板管理控制器根据预设的故障处理机制,对所述所述第一错误数据的故 障解析信息进行分析, 得到所述故障处理建议, 所述故障处理建议可以为故 障定位信息或者处理建议信息,从而用户或者故障维修人员可以根据所述故 障处理建议, 对所述计算机进行处理, 以恢复所述计算机。
S408: 所述基板管理控制器打印所述故障处理建议。
所述基板管理控制器在得到所述故障处理建议之后,可以将所述故障处 理建议打印出来, 或者还可以将所述故障处理建议和所述第一错误数据的故 障解析信息一起打印出来,从而用户或者故障维修人员可以根据打印的信息 对所述计算机进行处理, 以恢复所述计算机。
本发明实施例无需利用操作系统, 只需通过基板管理控制器就实现了计 算机死机后计算机中的错误数据的获取,解决了现有技术中计算机出现严重 的不可纠正错误导致系统死机后无法获取计算机中的错误数据的问题。 此 夕卜, 所述基板管理控制器还可以对所述第一错误数据进行解析, 并根据预设 的故障处理机制, 对所述第一错误数据的故障解析信息进行分析, 定位故障 源并给出处理建议。
由于步骤 S407 中, 只对所述第一错误数据的故障解析信息进行分析得 到故障处理建议, 所述第一错误数据可能只是所述计算机死机之前很短一段 时间内产生的错误数据, 例如, 所述第一错误数据为所述计算机死机之前 2 秒内产生的错误数据, 因此, 为了提高故障定位和分析的准确度, 可以对更 多的错误数据的故障解析信息进行分析。
在步骤 S403之前, 所述基板管理控制器还可以接收所述处理器发送的 第二错误数据 , 所述第二错误数据为所述计算机产生所述第一错误数据之前 预设时间内产生的错误数据。
则步骤 S407还可以为: 所述基板管理控制器根据故障解析机制, 对所 述第二错误数据进行解析, 得到所述第二错误数据的故障解析信息, 并对所 述第二错误数据的故障解析信息和所述第一错误数据的故障解析信息进行 分析, 得到所述故障处理建议。
在本发明实施例中,所述基板管理控制器可以对对所述第二错误数据的 故障解析信息和所述第一错误数据的故障解析信息进行分析,得到所述故障 处理建议, 提高故障定位和分析的准确度。
可选地, 在步骤 S405之后, 所述基板管理控制器还可以将所述第一错 误数据的故障解析信息、 所述第二错误数据的故障解析信息、 所述第一错误 数据和所述第二错误数据中的至少一种保存到所述计算机的故障信息库。例 如将所述第一错误数据的故障解析信息和所述第二错误数据的故障解析信 保存到所述故障信息库息, 或者将所述第一错误数据和所述第二错误数据保 存到所述故障信息库息, 从而在所述故障记录库中记录完整的故障记录。
可选地, 在步骤 S405之后, 所述基板管理控制器还可以向所述处理器 发送清除数据消息, 以指示所述处理器删除自身记录的所述第一错误数据, 避免存储资源的浪费。
本发明实施例三或实施例四中的基板管理控制器,具体可以参照本发明 实施例一或实施例二所述的基板管理控制器与处理器进行交互和进行故障 处理。
实施例五
本发明实施例提供了一种基板管理控制器,用于包括所述基板管理控制 器和处理器的计算机, 例如用于如图 1或 2中所述的计算机, 如图 5所示, 所述基板管理控制可以包括发送单元和接收单元;
所述发送单元, 用于在确定所述计算机死机时, 向所述处理器发送读请 求消息, 所述读请求消息用于请求读取所述处理器记录的第一错误数据; 虽 然所述计算机已经死机, 所述处理器无法执行任何计算机指令, 但是所述处 理器可以接收并响应所述读请求消息;
所述接收单元, 用于接收所述处理器返回的读响应消息, 并根据所述读 响应消息, 获得所述处理器记录的所述第一错误数据。 例如, 所述接收单元 可以在所述读响应消息中携带所述第一错误数据时,从所述读响应消息中获 得所述处理器记录的所述第一错误数据; 又例如, 所述接收单元可以在所述 读响应消息中携带读失败指示时,指示所述计算机的热重启单元或者用户对 所述计算机进行热重启, 以使得所述处理器在所述计算机热重启时, 执行所 述计算机的基本输入输出系统的故障收集指令,根据所述基本输入输出系统 的故障收集指令, 获取所述第一错误数据, 并发送给所述接收单元; 其中, 所述读失败指示用于指示从所述处理器中读取所述第一错误数据失败; 所述 接收单元接收所述处理器发送的所述第一错误数据。 可选地, 所述接收单元 还可以在获取到所述第一错误数据之后, 向所述处理器发送清除数据消息, 以指示所述处理器删除自身记录的所述第一错误数据, 避免存储资源的浪 费。
可选地, 所述基板管理控制器还可以包括确定单元, 用于接收所述处理 器发送的严重故障事件指示, 所述严重故障事件指示是所述处理器在获取到 所述第一错误数据并且所述第一错误数据属于严重的不可纠正错误类型时 发送的; 如果从接收到所述严重故障事件指示开始, 在预设等待时间内, 未 接收到所述处理器发送的至少部分所述第一错误数据, 则确定所述计算机死 机。
可选地, 所述基板管理控制器还可以包括故障告警单元, 用于在所述确 定单元接收所述处理器发送的严重故障事件指示后, 向所述计算机的故障故 障告警单元发送告警消息或进行打印操作, 以将所述严重故障告警事件通知 用户。
可选地, 所述基板管理控制器还可以包括故障处理单元, 用于根据故障 解析机制, 对所述第一错误数据进行解析, 得到所述第一错误数据的故障解 析信息。 所述第一错误数据的故障解析信息可以包括所述第一错误数据中每 个错误数据的产生时间、谁收集的该错误数据、该错误数据来自哪个处理器、 哪个核 (Core )、 属于什么错误等; 所述故障解析信息不仅可以提供给维修 人员或用户去了解故障情况, 还可以用于后续的故障定位、 分析和处理。 第一错误数据的故障解析信息进行分析, 得到故障处理建议。 所述预设的故 以为故障定位信息或者处理建议信息,从而用户或者故障维修人员可以根据 所述故障处理建议, 对所述计算机进行处理, 以恢复所述计算机。
由于故障处理单元只对所述第一错误数据的故障解析信息进行分析得 到故障处理建议, 所述第一错误数据可能只是所述计算机死机之前很短一段 时间内产生的错误数据, 例如所述第一错误数据为所述计算机死机前 0.8秒 内产生的错误数据, 因此为了提高故障定位和分析的准确度, 所述故障处理 单元可以对更多的错误数据的故障解析信息进行分析。 具体地, 所述接收单 元还用于接收所述处理器发送的第二错误数据; 所述第二错误数据为所述计 算机产生所述第一错误数据之前预设时间内产生的错误数据; 则可以根据所 述故障解析机制, 对所述第二错误数据进行解析, 得到所述第二错误数据的 故障解析信息, 根据所述预设的故障处理机制, 对所述第二错误数据的故障 解析信息和所述第一错误数据的故障解析信息进行分析,得到所述故障处理 建议。
可选地, 所述故障处理单元还用于打印所述第一错误数据的故障解析信 息或所述故障处理建议。
可选地, 所述故障处理单元还用于将所述第一错误数据的故障解析信 息、 所述第二错误数据的故障解析信息、 所述第一错误数据和所述第二错误 数据中的至少一种保存到所述计算机的故障信息库; 例如将所述第一错误数 据的故障解析信息和所述第二错误数据的故障解析信保存到所述故障信息 库息, 或者将所述第一错误数据和所述第二错误数据保存到所述故障信息库 息, 从而在所述故障记录库中记录完整的故障记录。 本发明实施例中的基板管理控制器, 具体可以参照本发明实施例一或实 施例二所述的基板管理控制器与处理器进行交互和进行故障处理。
在本发明实施例中, 所述发送单元可以在确定所述计算机死机时, 向所 述计算机的处理器发送读请求消息, 所述读请求消息用于请求读取所述处理 器记录的第一错误数据, 所述接收单元可以接收所述处理器返回的读响应消 息, 根据所述读响应消息, 获得所述处理器记录的所述第一错误数据。 本发 明实施例无需利用操作系统, 只需通过基板管理控制器就实现了计算机死机 后计算机中的错误数据的获取,解决了现有技术中计算机出现严重的不可纠 正错误导致系统死机后无法获取计算机中的错误数据的问题。
本发明实施例提供了一种计算机可读介质, 包括计算机执行指令, 以供 计算机的处理器执行所述计算机执行指令时, 所述计算机可以执行实施例三 或实施例四所述的故障处理方法。
如图 6, 为本发明实施例提供的一种基板管理控制器, 所述基板管理控 制器可以包括:
处理器 601、 存储器 602、 系统总线 604和通信接口 605。 处理器 601、 存储器 602和通信接口 605之间通过系统总线 604连接并完成相互间的通信。
处理器 601可能为单核或多核中央处理单元, 或者为特定集成电路, 或 者为被配置成实施本发明实施例的一个或多个集成电路。
存储器 602 可以为高速 RAM 存储器, 也可以为非易失性存储器 ( non-volatile memory ) , 例 口至少一个磁盘存 4诸器。
存储器 602用于计算机执行指令 603。 具体的, 计算机执行指令 603中 可以包括程序代码。
当所述基板管理控制器运行时, 处理器 601运行计算机执行指令 603, 可以执行实施例三或实施例四所述的故障处理方法的方法流程。
本领域普通技术人员将会理解, 本发明的各个方面、 或各个方面的可能 实现方式可以被具体实施为系统、 方法或者计算机程序产品。 因此, 本发明 的各方面、 或各个方面的可能实现方式可以釆用完全硬件实施例、 完全软件 实施例 (包括固件、驻留软件等等), 或者组合软件和硬件方面的实施例的形 式, 在这里都统称为"电路"、 "模块"或者"系统"。 此外, 本发明的各方面、 或各个方面的可能实现方式可以釆用计算机程序产品的形式,计算机程序产 品是指存储在计算机可读介质中的计算机可读程序代码。
计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。 计算机可读存储介质包含但不限于电子、 磁性、 光学、 电磁、 红外或半导体 系统、 设备或者装置, 或者前述的任意适当组合, 如随机存取存储器 (RAM), 只读存储器 (ROM)、 可擦除可编程只读存储器 (EPROM或者快闪 存储器)、 光纤、 便携式只读存储器 (CD-ROM:)。
计算机中的处理器读取存储在计算机可读介质中的计算机可读程序代 码, 使得处理器能够执行在流程图中每个步骤、 或各步骤的组合中规定的功 能动作;生成实施在框图的每一块、或各块的组合中规定的功能动作的装置。
计算机可读程序代码可以完全在用户的计算机上执行、部分在用户的计 算机上执行、 作为单独的软件包、 部分在用户的计算机上并且部分在远程计 算机上, 或者完全在远程计算机或者计算机上执行。 也应该注意, 在某些替 代实施方案中, 在流程图中各步骤、 或框图中各块所注明的功能可能不按图 中注明的顺序发生。 例如, 依赖于所涉及的功能, 接连示出的两个步骤、 或 两个块实际上可能被大致同时执行, 或者这些块有时候可能被以相反顺序执 行。
本领域普通技术人员可以意识到, 结合本文中所公开的实施例描述的各 示例的单元及算法步骤, 能够以电子硬件、 或者计算机软件和电子硬件的结 合来实现。 这些功能究竟以硬件还是软件方式来执行, 取决于技术方案的特 定应用和设计约束条件。 专业技术人员可以对每个特定的应用来使用不同方 法来实现所描述的功能, 但是这种实现不应认为超出本发明的范围。
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限 于此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保护 范围应所述以权利要求的保护范围为准。

Claims

权 利 要求
1、 一种计算机, 包括处理器和基板管理控制器, 其特征在于, 所述基板管理控制器用于在确定所述计算机死机时, 向所述处理器发送 读请求消息, 所述读请求消息用于请求读取所述处理器记录的第一错误数 据;
所述处理器用于接收所述读请求消息, 并向所述基板管理控制器发送读 响应消息;
所述基板管理控制器用于接收所述处理器返回的所述读响应消息,并根 据所述读响应消息, 获得所述处理器记录的所述第一错误数据。
2、 根据权利要求 1 所述的计算机, 其特征在于, 所述处理器还用于获 取所述第一错误数据, 并记录所述第一错误数据;
则所述基板管理控制器用于确定所述计算机死机具体为:
所述基板管理控制器用于接收所述处理器发送的严重故障事件指示, 所 述严重故障事件指示是所述处理器在获取到所述第一错误数据并且所述第 一错误数据属于严重的不可纠正错误类型时发送的;
如果从接收到所述严重故障事件指示开始, 在预设等待时间内, 未接收 到所述处理器发送的至少部分所述第一错误数据, 则所述基板管理控制器用 于确定所述计算机死机。
3、 根据权利要求 1或 2所述的计算机, 其特征在于, 所述基板管理控 制器用于根据所述读响应消息, 获得所述处理器记录的所述第一错误数据具 体为: 当所述读响应消息中携带所述第一错误数据时, 所述基板管理器用于 从所述读响应消息中获得所述处理器记录的所述第一错误数据。
4、 根据权利要求 1或 2所述的计算机, 其特征在于, 所述基板管理控 制器用于根据所述读响应消息 , 获得所述处理器记录的所述第一错误数据具 体为:
当所述读响应消息中携带读失败指示时,所述基板管理控制器用于指示 所述计算机的热重启模块或者用户对所述计算机进行热重启; 其中, 所述读 失败指示用于指示从所述处理器中读取所述第一错误数据失败, 以使得所述 处理器在所述计算机热重启时,执行所述计算机的基本输入输出系统的故障 收集指令, 根据所述基本输入输出系统的故障收集指令, 获取所述第一错误 数据, 并发送给所述基板管理控制器;
所述基板管理控制器用于接收所述处理器发送的所述第一错误数据。
5、 根据权利要求 1-4任一所述的计算机, 其特征在于, 所述基板管理 控制器在根据所述读响应消息 , 获得所述处理器记录的所述第一错误数据之 后, 还用于向所述处理器发送清除数据消息, 以指示所述处理器删除自身记 录的所述第一错误数据。
6、 根据权利要求 2所述的计算机, 其特征在于, 所述基板管理控制器 还用于在接收所述处理器发送的严重故障事件指示后, 向所述计算机的故障 告警模块发送告警消息或进行打印操作, 以将所述严重故障告警事件通知用 户。
7、 根据权利要求 1-6任一所述的计算机, 其特征在于, 所述基板管理 控制器还用于根据故障解析机制, 对所述第一错误数据进行解析, 得到所述 第一错误数据的故障解析信息。
8、 根据权利要求 7所述的计算机, 其特征在于, 所述基板管理控制器 还用于根据预设的故障处理机制,对所述第一错误数据的故障解析信息进行 分析, 得到故障处理建议。
9、 根据权利要求 8所述的计算机, 其特征在于, 所述基板管理控制器 在确定所述计算机死机之前, 还用于接收所述处理器发送的第二错误数据, 并根据所述故障解析机制, 对所述第二错误数据进行解析, 得到所述第二错 误数据的故障解析信息, 其中, 所述第二错误数据为所述计算机产生所述第 一错误数据之前预设时间内产生的错误数据;
则, 所述基板管理控制器用于根据预设的故障处理机制, 对所述第一错 误数据的故障解析信息进行分析, 得到故障处理建议包括:
所述基板管理控制器用于根据所述预设的故障处理机制,对所述第二错 误数据的故障解析信息和所述第一错误数据的故障解析信息进行分析,得到 所述故障处理建议。
10、 根据权利要求 7-9任一项所述的计算机, 其特征在于, 所述基板管 理控制器还用于打印所述第一错误数据的故障解析信息或所述故障处理建 议。
11、 根据权利要求 7-9任一项所述的计算机, 其特征在于, 所述基板管 理控制器还用于将所述第一错误数据的故障解析信息、所述第二错误数据的 故障解析信息、所述第一错误数据和所述第二错误数据中的至少一种保存到 所述计算机的故障信息库。
12、 一种故障处理方法, 用于包括基板管理控制器和处理器的计算机, 其特征在于, 所述方法包括:
所述基板管理控制器在确定所述计算机死机时, 向所述处理器发送读请 求消息, 所述读请求消息用于请求读取所述处理器记录的第一错误数据; 所述基板管理控制器接收所述处理器返回的读响应消息,并根据所述读 响应消息, 获得所述处理器记录的所述第一错误数据。
13、 根据权利要求 12所述的方法, 其特征在于, 所述方法还包括: 所述基板管理控制器接收所述处理器发送的严重故障事件指示,所述严 重故障事件指示是所述处理器在获取到所述第一错误数据并且所述第一错 误数据属于严重的不可纠正错误类型时发送的; 如果从接收到所述严重故障 事件指示开始, 在预设等待时间内, 未接收到所述处理器发送的至少部分所 述第一错误数据, 则确定所述计算机死机。
14、 根据权利要求 12或 13所述的方法, 其特征在于, 所述基板管理控 制器接收所述处理器返回的读响应消息, 并根据所述读响应消息, 获得所述 处理器记录的所述第一错误数据包括: 所述基板管理控制器在所述读响应消息中携带所述第一错误数据时,从 所述读响应消息中获得所述处理器记录的所述第一错误数据。
15、 根据权利要求 12或 13所述的方法, 其特征在于, 所述基板管理控 制器接收所述处理器返回的读响应消息, 并根据所述读响应消息, 获得所述 处理器记录的所述第一错误数据包括:
所述基板管理控制器在所述读响应消息中携带读失败指示时,指示所述 计算机的热重启模块或者用户对所述计算机进行热重启, 以使得所述处理器 在所述计算机热重启时,执行所述计算机的基本输入输出系统的故障收集指 令, 根据所述基本输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送给所述基板管理控制器; 其中, 所述读失败指示用于指示从所述处理 器中读取所述第一错误数据失败;
所述基板管理控制器接收所述处理器发送的所述第一错误数据。
16、 根据权利要求 13所述的方法, 其特征在于, 在所述基板管理控制 器接收所述处理器发送的严重故障事件指示后, 所述方法还包括:
所述基板管理控制器向所述计算机的故障告警模块发送告警消息或进 行打印操作, 以将所述严重故障告警事件通知用户。
17、 根据权利要求 12-16任一所述的方法, 其特征在于, 在所述基板管 理控制器根据所述读响应消息, 获得所述处理器记录的所述第一错误数据之 后, 所述方法还包括: 所述基板管理控制器根据故障解析机制, 对所述第一 错误数据进行解析, 得到所述第一错误数据的故障解析信息。
18、 根据权利要求 17所述的方法, 其特征在于, 所述方法还包括: 所 述基板管理控制器根据预设的故障处理机制,对所述第一错误数据的故障解 析信息进行分析, 得到故障处理建议。
19、 根据权利要求 18所述的方法, 其特征在于, 在所述基板管理控制 器确定所述计算机死机之前, 所述方法还包括: 所述基板管理控制器接收所 述处理器发送的第二错误数据; 其中, 所述第二错误数据为所述计算机产生 所述第一错误数据之前预设时间内产生的错误数据;
则, 所述基板管理控制器根据预设的故障处理机制, 对所述第一错误数 据的故障解析信息进行分析, 得到故障处理建议包括:
所述基板管理控制器根据所述故障解析机制,对所述第二错误数据进行 解析, 得到所述第二错误数据的故障解析信息, 并根据所述预设的故障处理 机制,对所述第二错误数据的故障解析信息和所述第一错误数据的故障解析 信息进行分析, 得到所述故障处理建议。
20、 根据权利要求 17-19任一项所述的方法, 其特征在于, 所述方法还 包括: 所述基板管理控制器打印所述第一错误数据的故障解析信息或所述故 障处理建议。
21、 根据权利要求 17-19任一项所述的方法, 其特征在于, 所述方法还 包括: 所述基板管理控制器将所述第一错误数据的故障解析信息、 所述第二 错误数据的故障解析信息、所述第一错误数据和所述第二错误数据中的至少 一种保存到所述计算机的故障信息库。
22、 一种基板管理控制器, 其特征在于, 包括:
发送单元, 用于在确定所述计算机死机时, 向所述处理器发送读请求消 息, 所述读请求消息用于请求读取所述处理器记录的第一错误数据;
接收单元, 用于接收所述处理器返回的读响应消息, 并根据所述读响应 消息, 获得所述处理器记录的所述第一错误数据。
23、 根据权利要求 22所述的基板管理控制器, 其特征在于, 还包括: 确定单元, 用于接收所述处理器发送的严重故障事件指示, 所述严重故 障事件指示是所述处理器在获取到所述第一错误数据并且所述第一错误数 据属于严重的不可纠正错误类型时发送的; 如果从接收到所述严重故障事件 指示开始, 在预设等待时间内, 未接收到所述处理器发送的至少部分所述第 一错误数据, 则确定所述计算机死机。
24、 根据权利要求 22或 23所述的基板管理控制器, 其特征在于, 所述 接收单元接收所述处理器返回的读响应消息, 并根据所述读响应消息, 获得 所述处理器记录的所述第一错误数据包括: 所述接收单元在所述读响应消息 中携带所述第一错误数据时,从所述读响应消息中获得所述处理器记录的所 述第一错误数据。
25、 根据权利要求 22或 23所述的基板管理控制器, 其特征在于, 所述 接收单元接收所述处理器返回的读响应消息, 并根据所述读响应消息, 获得 所述处理器记录的所述第一错误数据包括:
所述接收单元在所述读响应消息中携带读失败指示时,指示所述计算机 的热重启单元或者用户对所述计算机进行热重启, 以使得所述处理器在所述 计算机热重启时, 执行所述计算机的基本输入输出系统的故障收集指令, 根 据所述基本输入输出系统的故障收集指令, 获取所述第一错误数据, 并发送 给所述接收单元; 其中, 所述读失败指示用于指示从所述处理器中读取所述 第一错误数据失败;
所述接收单元接收所述处理器发送的所述第一错误数据。
26、 根据权利要求 23所述的基板管理控制器, 其特征在于, 还包括: 故障告警单元,用于在所述确定单元接收所述处理器发送的严重故障事 件指示后, 向所述计算机的故障故障告警单元发送告警消息或进行打印操 作, 以将所述严重故障告警事件通知用户。
27、 根据权利要求 22-26任一所述的基板管理控制器, 其特征在于, 还 包括:
故障处理单元,用于根据故障解析机制,对所述第一错误数据进行解析, 得到所述第一错误数据的故障解析信息。
28、 根据权利要求 27所述的基板管理控制器, 其特征在于, 所述故障 处理单元还用于根据预设的故障处理机制,对所述第一错误数据的故障解析 信息进行分析, 得到故障处理建议。
29、 根据权利要求 28所述的基板管理控制器, 其特征在于, 所述接收 单元还用于接收所述处理器发送的第二错误数据;
所述故障处理单元还用于根据所述故障解析机制,对所述第二错误数据 进行解析, 得到所述第二错误数据的故障解析信息; 其中, 所述第二错误数 据为所述计算机产生所述第一错误数据之前预设时间内产生的错误数据; 则, 所述故障处理单元用于根据预设的故障处理机制, 对所述第一错误 数据的故障解析信息进行分析, 得到故障处理建议包括:
所述故障处理单元根据所述预设的故障处理机制,对所述第二错误数据 的故障解析信息和所述第一错误数据的故障解析信息进行分析,得到所述故 障处理建议。
30、 根据权利要求 27-29任一项所述的基板管理控制器, 其特征在于, 所述故障处理单元还用于将所述第一错误数据的故障解析信息、所述第二错 误数据的故障解析信息、所述第一错误数据和所述第二错误数据中的至少一 种保存到所述计算机的故障信息库。
31、 一种基板管理控制器, 其特征在于, 所述基板管理控制器包括处理 器、 存储器、 总线和通信接口;
所述存储器用于存储计算机执行指令, 所述处理器与所述存储器通过所 述总线连接, 当所述基板管理控制器运行时, 所述处理器执行所述存储器存 储的所述计算机执行指令, 以使所述基板管理控制器执行如权利要求 12-21 中任一项所述的故障处理方法。
32、 一种计算机可读介质, 其特征在于, 包括计算机执行指令, 以供计 算机的处理器执行所述计算机执行指令时, 所述计算机执行如权利要求 12-21中任一项所述的故障处理方法。
PCT/CN2014/080618 2014-06-24 2014-06-24 一种故障处理方法、相关装置及计算机 WO2015196365A1 (zh)

Priority Applications (18)

Application Number Priority Date Filing Date Title
NO14896215A NO3121726T3 (zh) 2014-06-24 2014-06-24
PCT/CN2014/080618 WO2015196365A1 (zh) 2014-06-24 2014-06-24 一种故障处理方法、相关装置及计算机
BR112016022329A BR112016022329B1 (pt) 2014-06-24 2014-06-24 método para processamento de defeito, aparelho relacionado, e computador
SG11201607545PA SG11201607545PA (en) 2014-06-24 2014-06-24 Fault processing method, related apparatus, and computer
DK14896215.2T DK3121726T3 (en) 2014-06-24 2014-06-24 PROCEDURE FOR TROUBLESHOOTING, RELATED DEVICE AND COMPUTER
EP14896215.2A EP3121726B1 (en) 2014-06-24 2014-06-24 Fault processing method, related device and computer
AU2014399227A AU2014399227B2 (en) 2014-06-24 2014-06-24 Fault Processing Method, Related Apparatus and Computer
JP2016562222A JP6333410B2 (ja) 2014-06-24 2014-06-24 障害処理方法、関連装置、およびコンピュータ
KR1020167027222A KR101944874B1 (ko) 2014-06-24 2014-06-24 오류 처리 방법, 관련 장치 및 컴퓨터
CN201710454179.1A CN107357671A (zh) 2014-06-24 2014-06-24 一种故障处理方法、相关装置及计算机
CA2942045A CA2942045C (en) 2014-06-24 2014-06-24 Fault processing method, related apparatus, and computer
ES14896215.2T ES2667322T3 (es) 2014-06-24 2014-06-24 Método de tratamiento de fallos, dispositivo relacionado y ordenador
EP17199084.9A EP3355197B1 (en) 2014-06-24 2014-06-24 Fault processing method, related apparatus, and computer
CN201480056020.9A CN105659215B (zh) 2014-06-24 2014-06-24 一种故障处理方法、相关装置及计算机
ZA2016/06180A ZA201606180B (en) 2014-06-24 2016-09-06 FAULT PROCESSING METHOD, RELATED apparatus ,AND COMPUTER
US15/385,701 US10353763B2 (en) 2014-06-24 2016-12-20 Fault processing method, related apparatus, and computer
US16/509,218 US20190332453A1 (en) 2014-06-24 2019-07-11 Fault processing method, related apparatus, and computer
US17/187,111 US11360842B2 (en) 2014-06-24 2021-02-26 Fault processing method, related apparatus, and computer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/080618 WO2015196365A1 (zh) 2014-06-24 2014-06-24 一种故障处理方法、相关装置及计算机

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/385,701 Continuation US10353763B2 (en) 2014-06-24 2016-12-20 Fault processing method, related apparatus, and computer

Publications (1)

Publication Number Publication Date
WO2015196365A1 true WO2015196365A1 (zh) 2015-12-30

Family

ID=54936439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/080618 WO2015196365A1 (zh) 2014-06-24 2014-06-24 一种故障处理方法、相关装置及计算机

Country Status (14)

Country Link
US (3) US10353763B2 (zh)
EP (2) EP3121726B1 (zh)
JP (1) JP6333410B2 (zh)
KR (1) KR101944874B1 (zh)
CN (2) CN107357671A (zh)
AU (1) AU2014399227B2 (zh)
BR (1) BR112016022329B1 (zh)
CA (1) CA2942045C (zh)
DK (1) DK3121726T3 (zh)
ES (1) ES2667322T3 (zh)
NO (1) NO3121726T3 (zh)
SG (1) SG11201607545PA (zh)
WO (1) WO2015196365A1 (zh)
ZA (1) ZA201606180B (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975377A (zh) * 2016-04-29 2016-09-28 浪潮电子信息产业股份有限公司 一种监控内存的方法及装置
TWI709082B (zh) * 2019-07-08 2020-11-01 神雲科技股份有限公司 應用於開機階段及開機後運行階段的除錯訊息紀錄方法
TWI715201B (zh) * 2019-09-18 2021-01-01 神雲科技股份有限公司 開機錯誤資訊記錄方法
CN112346786A (zh) * 2019-08-08 2021-02-09 佛山市顺德区顺达电脑厂有限公司 应用于开机阶段及开机后运行阶段的除错信息纪录方法
CN113190396A (zh) * 2021-03-15 2021-07-30 山东英信计算机技术有限公司 一种收集cpu寄存器数据的方法、系统及介质

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107077408A (zh) 2016-12-05 2017-08-18 华为技术有限公司 故障处理的方法、计算机系统、基板管理控制器和系统
JP7063445B2 (ja) * 2017-03-22 2022-05-09 Necプラットフォームズ株式会社 障害情報処理プログラム、コンピュータ、障害通知方法、コンピュータシステム
CN108108259A (zh) * 2018-01-11 2018-06-01 郑州云海信息技术有限公司 一种内核故障定位方法及装置
CN108958965B (zh) * 2018-06-28 2021-03-02 苏州浪潮智能科技有限公司 一种bmc监控可恢复ecc错误的方法、装置及设备
CN109240847A (zh) * 2018-09-27 2019-01-18 郑州云海信息技术有限公司 一种post过程中内存错误上报方法、装置、终端及存储介质
US10846162B2 (en) * 2018-11-29 2020-11-24 Oracle International Corporation Secure forking of error telemetry data to independent processing units
CN109783325B (zh) * 2018-12-14 2023-07-25 平安证券股份有限公司 业务监控方法、装置、设备及存储介质
CN109947585A (zh) * 2019-03-13 2019-06-28 西安易朴通讯技术有限公司 Pcie设备故障的处理方法及装置
CN110532160B (zh) * 2019-09-03 2023-07-25 深圳市智微智能科技股份有限公司 一种bmc记录服务器系统热重启事件的方法
US11243859B2 (en) * 2019-10-09 2022-02-08 Microsoft Technology Licensing, Llc Baseboard management controller that initiates a diagnostic operation to collect host information
CN111008091A (zh) * 2019-12-06 2020-04-14 苏州浪潮智能科技有限公司 一种内存ce的故障处理方法、系统及相关装置
US11132314B2 (en) * 2020-02-24 2021-09-28 Dell Products L.P. System and method to reduce host interrupts for non-critical errors
CN113535502A (zh) * 2020-04-17 2021-10-22 捷普科技(上海)有限公司 用于服务器系统的错误日志收集方法
US11204821B1 (en) * 2020-05-07 2021-12-21 Xilinx, Inc. Error re-logging in electronic systems
CN111581058B (zh) * 2020-05-09 2024-03-19 西安易朴通讯技术有限公司 故障管理方法、装置、设备及计算机可读存储介质
CN112181522A (zh) * 2020-09-28 2021-01-05 亚信科技(中国)有限公司 数据处理的方法、装置以及电子设备
CN112256467B (zh) * 2020-10-23 2022-08-02 英业达科技有限公司 错误类型判断系统及其方法
US11269729B1 (en) * 2020-12-21 2022-03-08 Microsoft Technology Licensing, Llc Overloading a boot error signaling mechanism to enable error mitigation actions to be performed
CN113076210B (zh) * 2021-03-26 2023-01-20 山东英信计算机技术有限公司 服务器故障诊断结果通知方法、系统、终端及存储介质
CN113726555A (zh) * 2021-08-02 2021-11-30 华迪计算机集团有限公司 一种适用于数据通信网络辅助解析告警的系统及方法
CN114201360B (zh) * 2021-11-26 2023-11-17 苏州浪潮智能科技有限公司 一种aer功能管理方法、装置、服务器和存储介质
US11921582B2 (en) * 2022-04-29 2024-03-05 Microsoft Technology Licensing, Llc Out of band method to change boot firmware configuration
TWI800443B (zh) * 2022-08-15 2023-04-21 緯穎科技服務股份有限公司 快速周邊組件互連裝置的錯誤回報優化方法以及快速周邊組件互連裝置的錯誤回報優化系統

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101126995A (zh) * 2006-08-14 2008-02-20 国际商业机器公司 处理严重硬件错误的方法及设备
US20100313072A1 (en) * 2009-06-03 2010-12-09 International Business Machines Corporation Failure Analysis Based on Time-Varying Failure Rates
CN102467440A (zh) * 2010-11-09 2012-05-23 鸿富锦精密工业(深圳)有限公司 内存错误检测系统及方法

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02234241A (ja) * 1989-03-08 1990-09-17 Hitachi Ltd リセット・リトライ回路
JPH0375844A (ja) * 1989-08-17 1991-03-29 Nec Corp 障害自動解析方式
JPH05233377A (ja) * 1992-01-09 1993-09-10 Nec Corp レジスタ情報収集方式
JPH09288602A (ja) * 1996-04-23 1997-11-04 Fujitsu Ltd 障害情報記憶装置の書込み保護装置とリセット制御方法
JPH09286602A (ja) 1996-04-24 1997-11-04 Mitsubishi Gas Chem Co Inc 一酸化炭素及び水素の混合ガスの製造方法
US20030070115A1 (en) * 2001-10-05 2003-04-10 Nguyen Tom L. Logging and retrieving pre-boot error information
JP3902564B2 (ja) * 2003-04-15 2007-04-11 中部日本電気ソフトウェア株式会社 障害通報装置および障害通報方法
US7844866B2 (en) * 2007-10-02 2010-11-30 International Business Machines Corporation Mechanism to report operating system events on an intelligent platform management interface compliant server
JP2005251060A (ja) * 2004-03-08 2005-09-15 Hitachi Ltd 故障表示装置および故障部位表示方法
US7409594B2 (en) * 2004-07-06 2008-08-05 Intel Corporation System and method to detect errors and predict potential failures
US7546487B2 (en) * 2005-09-15 2009-06-09 Intel Corporation OS and firmware coordinated error handling using transparent firmware intercept and firmware services
US20070088988A1 (en) 2005-10-14 2007-04-19 Dell Products L.P. System and method for logging recoverable errors
US20070234123A1 (en) * 2006-03-31 2007-10-04 Inventec Corporation Method for detecting switching failure
US20080270827A1 (en) * 2007-04-26 2008-10-30 International Business Machines Corporation Recovering diagnostic data after out-of-band data capture failure
JP5514643B2 (ja) * 2010-06-21 2014-06-04 株式会社日立ソリューションズ 障害原因判定ルール変化検知装置及びプログラム
CN102375775B (zh) 2010-08-11 2014-08-20 英业达股份有限公司 一种具有检测系统不可恢复错误指示信号的计算机系统
JP5541519B2 (ja) * 2010-10-06 2014-07-09 エヌイーシーコンピュータテクノ株式会社 情報処理装置、故障部位判別方法および故障部位判別プログラム
CN102467417B (zh) 2010-11-19 2014-04-23 英业达股份有限公司 计算机系统
TWI446161B (zh) * 2010-12-30 2014-07-21 Ibm 處理一多處理器資訊處理系統之一故障處理器的裝置及方法
US8898408B2 (en) * 2011-12-12 2014-11-25 Dell Products L.P. Memory controller-independent memory mirroring
EP2859459B1 (en) * 2012-06-06 2019-12-25 Intel Corporation Recovery after input/ouput error-containment events
CN103514068A (zh) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 内存故障自动定位方法
JP6087540B2 (ja) * 2012-08-30 2017-03-01 Necプラットフォームズ株式会社 障害トレース装置、障害トレースシステム、障害トレース方法、及び、障害トレースプログラム
CN103647804B (zh) * 2013-11-22 2017-04-26 华为技术有限公司 一种存储单元的数据处理方法、设备及系统
AU2016247689B2 (en) 2015-04-13 2020-07-02 Samsung Electronics Co., Ltd. Technique for managing profile in communication system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101126995A (zh) * 2006-08-14 2008-02-20 国际商业机器公司 处理严重硬件错误的方法及设备
US20100313072A1 (en) * 2009-06-03 2010-12-09 International Business Machines Corporation Failure Analysis Based on Time-Varying Failure Rates
CN102467440A (zh) * 2010-11-09 2012-05-23 鸿富锦精密工业(深圳)有限公司 内存错误检测系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3121726A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975377A (zh) * 2016-04-29 2016-09-28 浪潮电子信息产业股份有限公司 一种监控内存的方法及装置
TWI709082B (zh) * 2019-07-08 2020-11-01 神雲科技股份有限公司 應用於開機階段及開機後運行階段的除錯訊息紀錄方法
CN112346786A (zh) * 2019-08-08 2021-02-09 佛山市顺德区顺达电脑厂有限公司 应用于开机阶段及开机后运行阶段的除错信息纪录方法
TWI715201B (zh) * 2019-09-18 2021-01-01 神雲科技股份有限公司 開機錯誤資訊記錄方法
CN113190396A (zh) * 2021-03-15 2021-07-30 山东英信计算机技术有限公司 一种收集cpu寄存器数据的方法、系统及介质

Also Published As

Publication number Publication date
EP3121726A4 (en) 2017-05-03
US20210182136A1 (en) 2021-06-17
BR112016022329A2 (pt) 2017-08-15
CN105659215A (zh) 2016-06-08
CN105659215B (zh) 2017-08-25
JP6333410B2 (ja) 2018-05-30
US20190332453A1 (en) 2019-10-31
KR20160128404A (ko) 2016-11-07
ES2667322T3 (es) 2018-05-10
CA2942045C (en) 2019-04-16
AU2014399227A1 (en) 2016-09-22
ZA201606180B (en) 2019-04-24
US10353763B2 (en) 2019-07-16
US11360842B2 (en) 2022-06-14
CN107357671A (zh) 2017-11-17
AU2014399227B2 (en) 2017-07-27
SG11201607545PA (en) 2016-10-28
EP3355197A1 (en) 2018-08-01
BR112016022329B1 (pt) 2019-01-02
EP3355197B1 (en) 2019-10-23
US20170102985A1 (en) 2017-04-13
CA2942045A1 (en) 2015-12-30
JP2017517060A (ja) 2017-06-22
KR101944874B1 (ko) 2019-02-01
NO3121726T3 (zh) 2018-06-30
EP3121726A1 (en) 2017-01-25
EP3121726B1 (en) 2018-01-31
DK3121726T3 (en) 2018-05-22

Similar Documents

Publication Publication Date Title
WO2015196365A1 (zh) 一种故障处理方法、相关装置及计算机
US11687391B2 (en) Serializing machine check exceptions for predictive failure analysis
WO2017063505A1 (zh) 一种服务器硬件故障检测方法及其装置和服务器
US20140019814A1 (en) Error framework for a microprocesor and system
US10496495B2 (en) On demand remote diagnostics for hardware component failure and disk drive data recovery using embedded storage media
CN117389790B (zh) 可恢复故障的固件检测系统、方法、存储介质及服务器
CN103823708A (zh) 虚拟机读写请求处理的方法和装置
US20200201706A1 (en) Recovery of application from error
CN112988442B (zh) 一种服务器运行阶段传送故障信息的方法和设备
JP6222759B2 (ja) 障害通知装置、障害通知方法及びプログラム
TW201324115A (zh) 電腦系統及電腦系統的開機管理方法
CN114217925A (zh) 一种实现异常自动重启的业务程序运行监控方法及系统
CN114356708A (zh) 一种设备故障监控方法、装置、设备及可读存储介质
TWI602054B (zh) 用於電腦裝置的錯誤狀態資料提供方法
CN116560936A (zh) 异常监测方法、协处理器及计算设备
JP2006011991A (ja) コンピュータ制御装置およびこのソフトウェア実行記録方式
CN117931536A (zh) 故障处理方法、装置、电子设备和介质
JP2011159234A (ja) 障害対応システム及び障害対応方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14896215

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2942045

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2014399227

Country of ref document: AU

Date of ref document: 20140624

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20167027222

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2016562222

Country of ref document: JP

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2014896215

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014896215

Country of ref document: EP

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112016022329

Country of ref document: BR

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112016022329

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20160927