CN114661511A - Equipment error reporting processing method, device, equipment and storage medium - Google Patents

Equipment error reporting processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN114661511A
CN114661511A CN202210331196.7A CN202210331196A CN114661511A CN 114661511 A CN114661511 A CN 114661511A CN 202210331196 A CN202210331196 A CN 202210331196A CN 114661511 A CN114661511 A CN 114661511A
Authority
CN
China
Prior art keywords
error
correctable
uncorrectable
errors
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210331196.7A
Other languages
Chinese (zh)
Inventor
马井彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210331196.7A priority Critical patent/CN114661511A/en
Publication of CN114661511A publication Critical patent/CN114661511A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for processing equipment error report; in the scheme, after receiving the error detection instruction, the substrate management controller circularly detects whether correctable errors or uncorrectable errors which do not need to be processed exist according to the error detection instruction; if so, recording the detected error. Therefore, in the scheme, the system does not trigger system management interruption when generating correctable errors and uncorrectable errors which do not need to be processed, the switching times of a normal operation mode and a system management mode can be reduced by the mode, and the performance of the system is improved; in addition, the method and the system can help a user to predict the failure trend of the component by detecting and recording correctable errors and uncorrectable errors which are generated in the system and do not need to be processed, and processing measures can be given in advance, so that the risk that the server system is unstable or data is lost due to sudden failure of the component is avoided.

Description

Equipment error reporting processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of error reporting processing technologies, and in particular, to an error reporting processing method and apparatus for a device, and a storage medium.
Background
The server system is composed of various components such as a server mainboard, a processor, a memory, a storage device, a PCIE (peripheral component interface Express) device and the like, and is an important infrastructure of a modern data center. During the operation of the server system, one of the most critical requirements is the stability of the operation of the server system, and the security and integrity of data. Various factors can cause hardware errors when the server system runs, and if the hardware errors are not processed in time, catastrophic safety hazards can be brought to the server system. The complete error reporting processing system is crucial to the stable operation of the server system, including detection, correction, recording, etc. of errors. The server System with the X86 architecture processes error reporting in the form of SMI interrupt (System Management interrupt), and the interrupt program will perform operations such as clearing hardware errors and recovering data, so as to ensure that errors are not accumulated, thereby avoiding the potential safety hazard of the server System.
For an X86 server system, correctable errors are repaired by retry or ECC (Error Check and Correction code), an SMI interrupt is not triggered, or the SMI interrupt is triggered after a certain number of errors are reached, an SMI interrupt needs to be triggered immediately when uncorrectable errors are not corrected, and an interrupt program is called to execute processing operation after the SMI interrupt is triggered. For the processing of correctable errors in the system, when SMI interruption is not triggered, the errors cannot be recorded in real time, and error-reporting components and fault trends in the system cannot be found in real time, so that potential hidden dangers are caused to safe and stable operation of the system; when SMI interruption is triggered in real time, the system is frequently switched from a normal operation mode to a system management mode to execute an interruption program, the system returns to the normal operation mode after processing is completed, the performance of the system is reduced, if an error storm occurs, the system is switched to enter the system management mode to execute the interruption program in most of time, and the program in the normal mode cannot run normally. Therefore, how to reduce the number of SMI interrupts in the system and detect and record errors in real time is a problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a method, a device, equipment and a storage medium for processing equipment error report, so as to reduce the number of SMI interrupts in a system and detect and record errors in real time.
In order to achieve the above object, the present invention provides an error reporting processing method for a device, including:
the baseboard management controller receives an error detection instruction;
circularly detecting whether correctable errors or uncorrectable errors exist according to the error-reporting detection instruction; wherein the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur;
if so, recording the detected error.
Wherein, detecting whether a correctable error exists according to the error detection instruction loop comprises:
circularly determining target PCIE equipment to be detected from the PCIE equipment list data;
reading register data of a correctable state register from target PCIE equipment;
judging whether a correctable error state bit is set or not according to the register data;
if not, judging that no correctable errors exist in the target PCIE equipment;
if yes, the fact that correctable errors exist in the target PCIE equipment is judged, the error type corresponding to the set status bit is determined, and the PCIE equipment giving error reports is determined.
If it is determined that a correctable error exists in the target PCIE device, recording the detected error includes:
and recording the target PCIE equipment, the PCIE equipment causing the error report and the error type to a log system.
Wherein, after recording the target PCIE device, the PCIE device causing error report, and the error type to the log system, the method further includes:
and sending a status bit clearing instruction to the target PCIE device so as to clear the set status bit in the correctable status register of the target PCIE device.
Wherein, detecting whether a correctable error or an uncorrectable error exists according to the error detection instruction loop comprises:
circularly determining a target MCA state register to be detected from MCA state register list data;
reading status data from a target MCA status register;
determining whether an error condition is set based on the status data; if not, judging that no correctable error or uncorrectable error exists in the target MCA state register;
if yes, determining whether a correctable error or an uncorrectable error exists according to the set status bit; if the correctable errors or the uncorrectable errors exist, determining the corresponding error type; and if the correctable error and the uncorrectable error do not exist, continuously executing the step of circularly determining the target MCA status register to be detected from the MCA status register list data.
Wherein, the determining whether correctable errors or uncorrectable errors exist according to the set status bits, and if correctable errors or uncorrectable errors exist, determining the corresponding error type includes:
determining whether the error is an uncorrectable error according to the set status bit;
if the error is an uncorrectable error, determining the error type of the uncorrectable error; if not, determining whether the error is correctable according to the set status bit;
if the error is not correctable, judging that no correctable error exists; if the error can be corrected, reading the total occurrence number of the correctable errors, and judging whether the total occurrence number of the current reading is increased compared with the total occurrence number of the previous reading;
if the total occurrence frequency of the current reading is increased compared with the total occurrence frequency of the last reading, judging that correctable errors exist, and determining the error type of the correctable errors;
if the total occurrence number of the current reading is not increased compared with the total occurrence number of the last reading, it is determined that no correctable error exists.
Wherein, if there is a correctable error or an uncorrectable error, the recording the detected error includes:
if the uncorrectable error exists, recording the target MCA status register and the error type of the uncorrectable error to a log system;
if correctable errors are present, the target MCA status register, the type of error for which the error is correctable, and the total number of occurrences of correctable errors are logged to the log system.
In order to achieve the above object, the present invention further provides an apparatus error reporting processing device applied to a baseboard management controller, comprising:
the receiving module is used for receiving an error detection instruction;
the detection module is used for circularly detecting whether correctable errors or uncorrectable errors exist according to the error detection instruction; wherein the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur;
and the recording module is used for recording the detected errors when the correctable errors or the uncorrectable errors are detected.
To achieve the above object, the present invention further provides an electronic device comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the error reporting processing method of the equipment when the computer program is executed.
To achieve the above object, the present invention further provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the error handling method of the above device.
According to the scheme, the method, the device, the equipment and the storage medium for processing the equipment error report are provided by the embodiment of the invention; in the scheme, after receiving the error detection instruction, the substrate management controller circularly detects whether correctable errors or uncorrectable errors exist according to the error detection instruction; the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur; if so, recording the detected error.
Therefore, in the scheme, the system does not trigger system management interruption when generating correctable errors and uncorrectable errors which do not need to be processed, the switching times of a normal operation mode and a system management mode can be reduced by the mode, and the performance of the system is improved; in addition, the method and the system can help a user to predict the failure trend of the component by detecting and recording correctable errors and uncorrectable errors which are generated in the system and do not need to be processed, and processing measures can be given in advance, so that the risk that the server system is unstable or data is lost due to sudden failure of the component is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of an error reporting processing method of equipment according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an error reporting system of a device according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating the configuration of the device before error detection according to the embodiment of the present invention;
fig. 4 is a flowchart illustrating a specific error reporting processing method of the device according to the embodiment of the present invention;
fig. 5 is a schematic flowchart of another error reporting method for a device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an error reporting processing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method, a device, equipment and a storage medium for processing equipment error report, which are used for reducing the number of SMI interrupts in a system and detecting and recording errors in real time.
Referring to fig. 1, a flow diagram of an apparatus error reporting processing method provided in an embodiment of the present invention specifically includes:
s101, receiving an error detection instruction by a substrate management controller;
specifically, the board Management controller is a bmc (Basic Management controller), and in the present application, the error detection command received by the board Management controller may be a command sent by a Basic Input and Output System (BIOS). Referring to fig. 2, it can be seen from fig. 2 that the error reporting system of the device completes an error reporting processing process through the cooperation of the BIOS and the BMC, where a configuration module in the BIOS configures and starts an error reporting system function after the system is started and operated, sends an error reporting detection command to the BMC, and notifies the BMC to start detecting a hardware error reporting state; after receiving the error detection command, the error processing module of the BMC performs polling detection on the error status of the hardware of the system through a PECI Interface (Platform Environment Control Interface), reads correctable errors and uncorrectable errors which do not need to be processed, and processes and records error information.
S102, circularly detecting whether correctable errors or uncorrectable errors exist according to an error detection instruction; wherein the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur; if yes, executing S103; if not, executing S102, and continuously detecting whether a correctable error or an uncorrectable error exists;
in this embodiment, the uncorrectable errors are classified into catastrophic errors, critical errors, and recoverable errors, where the recoverable errors are classified into non-processing-required errors, optional software recoverable operations, and software recoverable-required operations, and the uncorrectable errors targeted by the present application are uncorrectable errors that do not need to be processed. In addition, in the present solution, it is specifically required to detect whether correctable errors exist in PCIE devices of the processor and the south bridge, and whether correctable errors or uncorrectable errors exist in non-PCIE devices. The PCIE device in this scheme is specifically a PCIE device that supports AER (Advanced Error Reporting) attributes, and for a device that does not support AER attributes, the PCIE device belongs to a special case, and at present, there are almost no external devices, and individual devices integrated by a south bridge chip may ignore and not detect the devices; in each PCIE device, if a correctable error occurs, the PCIE device sets a corresponding status bit in a correctable status register in the AER attribute register. The non-PCIE device in this embodiment corresponds to a function group of the processor, for example: in this embodiment, specifically, the error status is recorded by an MCA (Machine Check Architecture) status register, so that the application can record whether the non-PCIE device reports an error and record the error by reading status data of the MCA status register.
Before the error detection is performed, a preparation operation needs to be performed for the error detection, referring to fig. 3, which is a flow chart for setting before the error detection according to the embodiment of the present invention, as can be seen from fig. 3, after the BIOS system is started, a function of the BMC to read error information by polling through the pecial interface is prepared, and the preparation process specifically includes the following operations: setting the MCA status register to be in a normal mode readable state (the MCA status register is visible in a system management mode by default), prohibiting CMCI interruption (Corrected Machine Check Error Interrupt, correctable mechanism Check Error Interrupt), and transmitting the MCA status register list to the BMC; enumerating PCIE equipment, prohibiting the PCIE equipment from correcting error interruption, and transmitting a PCIE equipment list and correctable state register information to a BMC; and starting the PECI interface to write in the register authority of the PCIE equipment, starting an error reporting function, and sending a detection starting command to the BMC by the BIOS.
Specifically, in the scheme, a CMCI disable bit needs to be set to an msr (model Specific registers) error control register to disable the generation of the CMCI interrupt, and the CMCI is set to be mapped to the SMI, so that the CMCI interrupt is not triggered after the correctable error and the uncorrectable error which does not need to be processed occur, and the SMI interrupt is not triggered. The MSR error control register is used for recording the working environment, performance, temperature and the like of the processor, and the MCA status register is a part of the MSR error control register. Each MCA status register includes a plurality of error codes, the scheme needs to analyze the MCA status registers one by one, list error types represented by each error code, form a MCA status register list, and transmit the MCA status register list to the BMC for storage, where the error types are specifically: the type of error of each functional block in the processor is not particularly limited. Moreover, each MCA status register is represented in the MCA status register list data by an MCA status register number, such as: in 100 MSR error control registers with the register numbers of 0-99, the register numbers of 10-20 are MCA status registers, and when the MCA status registers are read, the MCA status registers corresponding to the register numbers of 10-20 need to be read in sequence.
When enumerating the PCIE devices, the present solution needs to summarize a Bus number (Bus), a Device number (Device), a Function number (Function) of the PCIE Device supporting the AER attribute, and a correctable status register number of a PCIE Device supporting the AER attribute structure, generate PCIE Device list data, transmit the PCIE Device list data to the BMC for storage, and the BMC may sequentially query whether each PCIE Device has an error report through the PCIE Device list data; the bus number, the device number, and the function number are used to indicate an address of the PCIE device. In addition, in the scheme, the correctable state register needs to be analyzed, the error type represented after each state position is set is listed, correctable state register information data are generated after the error types are summarized and are transmitted to the BMC for storage, and the BMC can determine the corresponding error type after the state position of the correctable state register is set through the correctable state register information data. Further, due to safety reasons, the PECI interface in a default state can only read the PCIE device register, the PECI interface has the authority of writing in the PCIE device register by setting the authority of the PECI management module for configuring the PECI interface, after configuration is completed, the error reporting function of the system is started, and meanwhile, a command is sent to the BMC to inform the BMC to start to poll and detect the error reporting state of hardware in the system through the PECI interface.
That is to say, when detecting whether correctable errors or uncorrectable errors exist, on one hand, the present solution can search for the PCIE device to be detected according to the PCIE device list data, and determine whether correctable errors exist according to the status bits of the correctable status registers of the PCIE device, and if so, determine the error type of the correctable errors according to the correctable status register information data; on the other hand, it may also be queried based on the MCA status register list data whether the status bit of each MCA status register is set, and if so, continue to determine the type of error that is correctable or uncorrectable.
And S103, recording the detected errors.
Specifically, if the scheme detects that correctable errors or uncorrectable errors which do not need to be processed exist in the system, the detected errors can be recorded; moreover, since the hardware error reporting is unpredictable and may occur at any time, the scheme can circularly detect whether the correctable error or the uncorrectable error exists in a polling mode so as to read the error reporting information of the PCIE device and the error reporting information of the MCA status register in time.
In summary, in the scheme, for correctable errors with a high ratio of the number of reported errors in the system and uncorrectable errors which do not need to be processed, the SMI interrupt is not triggered, and the error reporting is detected and recorded in a polling manner. The polling mode greatly reduces the number of SMI interrupts in the system, can meet the requirements of real-time detection and recording of hardware errors, can reduce the times of executing interrupt programs in a system switching mode, and improves the performance of the system.
It should be noted that, when performing error reporting processing, the present scheme may perform error processing on a PCIE device and a non-PCIE device, where the error reporting processing performed on the PCIE device mainly performs error detection and recording on a correctable status register in an AER attribute structure of the PCIE device, and the error reporting processing performed on the non-PCIE device mainly performs error detection and recording on an MCA status register, and the two detect records of errors complement each other, so as to implement more comprehensive coverage on devices of different types.
Therefore, in this embodiment, after detecting that the error detection instruction starts to detect the error status of the system hardware, the BMC needs to detect the correctable status register and the MCA status register in the AER attribute structure of the PCIE device, and during detection, the correctable status register and the MCA status register may be detected simultaneously or sequentially according to a preset sequence, which is not limited specifically herein. Fig. 4 is a specific process of performing error handling on a PCIE device according to the embodiment of the present invention, and fig. 5 is a specific process of performing error handling on a non-PCIE device according to the embodiment of the present invention, which are respectively described in detail herein.
Referring to fig. 4, a flowchart of a specific method for processing an error report of a device according to an embodiment of the present invention includes:
s201, the substrate management controller receives an error detection instruction;
specifically, after the BMC system in this scheme is started and operated, the PECI controller in the BMC chip needs to be configured, for example: and configuring parameters such as clock, sampling rate, reading mode and the like of the PECI controller. And starting the PECI controller after the configuration is finished, wherein the PECI controller can circularly detect whether an error detection instruction sent by the BIOS is received, and actively reads the information of each register in the processor according to the information after storing PCIE equipment list data, correctable state register information data and MCA state register list data sent by the BIOS if the error detection instruction is detected so as to carry out error processing.
S202, circularly determining target PCIE equipment to be detected from the PCIE equipment list data;
s203, reading register data of the correctable state register from the target PCIE equipment;
in this embodiment, when detecting a PCIE device, first determine a target PCIE device to be detected from PCIE device list data, and read basic information of the target PCIE device from the PCIE device list data, and this application determines that the process of determining the target PCIE device to be detected is performed in a cycle, that is: each PCIE device in the PCIE device list data needs to be cyclically used as a target PCIE device for detection, for example: if 10 PCIE devices are recorded in the PCIE device list data, determining the target PCIE device to be detected in a loop means: after detecting the 1 st PCIE equipment as the target PCIE equipment, continuously detecting the 2 nd PCIE equipment as the target PCIE equipment, and analogizing in sequence, after detecting all 10 PCIE equipment, continuously detecting the 1 st PCIE equipment as the target PCIE equipment until the detection function is closed or the equipment is powered off and stopped; the basic information includes: the bus number, the device number, the function number and the correctable state register number of the AER attribute structure of the target PCIE device; when reading register data of a correctable status register from a target PCIE device, basic information of the target PCIE device needs to be combined with a configuration mode (a message type and an address type), a command code, a read byte length, and the like, and a PECI interface read command is sent to the target PCIE device in the processor, so as to obtain the register data of the correctable status register in the target PCIE device.
S204, judging whether the correctable error state bit is set or not according to the register data;
if not, executing S205; if yes, go to S206;
s205, judging that no correctable error exists in the target PCIE equipment, and continuously executing S202; wherein, the correctable error does not trigger system management interruption after the correctable error occurs;
after the register data is acquired, the register data needs to be analyzed, whether correctable error state bits are set or not is judged, if not, S205 is executed, it is judged that correctable errors do not exist in the target PCIE equipment, then register data of the correctable state registers of the next target PCIE equipment is read according to the PCIE equipment list data, and if the state bits are set, S206-S208 are continuously executed.
S206, judging that correctable errors exist in the target PCIE equipment, and determining the error type corresponding to the set status bit and the PCIE equipment causing error reporting;
specifically, after it is determined that correctable errors exist in the target PCIE device, the error type of the correctable errors can be analyzed according to the correctable state register information data, and then data is read from the error source register of the target PCIE device, where the data records device information that causes the target PCIE device to report errors, for example: the device bus number, device number, function number of the PCIE device that caused the error.
S207, recording the target PCIE equipment, the PCIE equipment causing the error report and the error type to a log system;
s208, sending a status bit clearing instruction to the target PCIE device, so as to clear the status bit set in the correctable status register of the target PCIE device, and continuing to execute S202.
In this embodiment, the bus number, the device number, the function number, and the error type of the target PCIE device reporting an error and the PCIE device causing the error may be recorded in the log system, and displayed in real time for the user to look up in time. And then sending a PECI interface command for setting a register to the error-reported target PCIE equipment so as to clear the error-reported state of the correctable state register, then acquiring the basic data of the next PCIE equipment from the PCIE equipment list, starting reading processing, repeating the operation until each PCIE equipment is detected, and waiting for the next round of detection.
Referring to fig. 5, a schematic flow chart of another error reporting processing method for a device according to an embodiment of the present invention specifically includes:
s301, the substrate management controller receives an error detection instruction;
s302, circularly determining a target MCA state register to be detected from MCA state register list data;
specifically, for the detection of the MCA status register in this embodiment, a target MCA status register to be detected is first determined from the MCA status register list data, and basic information of the target MCA status register is read from the MCA status register list data, such as: the MCA register number of the target MCA status register; moreover, the process of determining the status register of the target MCA to be detected is executed in a loop, that is: if 10 MCA status registers are recorded in the MCA status register list data, the circularly determining the target MCA status register to be detected means: and after the 1 st MCA state register is used as a target MCA state register for detection, continuously detecting the 2 nd MCA state register as the target MCA state register, and repeating the detection until the detection function is closed or the equipment is powered off and stops after 10 MCA state registers are all detected, and then continuously detecting the 1 st MCA state register as the target MCA state register.
S303, reading state data from a target MCA state register;
having determined the base information of the target MCA status register, the base information, in conjunction with the processor ID, read byte length, command code, etc., may be required to send a command to the corresponding processor to read the MCA status register in order to read the corresponding status data from the target MCA status register.
S304, determining whether an error state is set according to the state data;
if not, executing S305; if yes, executing S306;
s305, judging that no correctable error or uncorrectable error exists in the target MCA state register, and continuing to execute S302;
in this embodiment, the returned status data needs to be analyzed, and if the error status is not set, S302 is continuously executed, so as to obtain the MCA register number registered in the next target MCA status from the MCA status register list, and S303 is continuously executed; if the error status is set, S306 is performed to continue determining whether there is a correctable error or an uncorrectable error based on the status bit being set.
S306, determining whether correctable errors or uncorrectable errors exist according to the set status bits;
if there is a correctable error or an uncorrectable error, executing S307; if there are no correctable errors and uncorrectable errors, execute S305;
s307, determining the error type of correctable errors or uncorrectable errors;
s308, if the uncorrectable error exists, recording the error type of the target MCA state register and the uncorrectable error into a log system; if there are correctable errors, the status register of the target MCA, the type of error that can correct the error, and the total number of occurrences that can correct the error are recorded to the log system, and the process continues to S302.
In the present embodiment, when determining whether there is a correctable error or an uncorrectable error, it may be determined whether there is an uncorrectable error first according to the set status bit; if the error is an uncorrectable error, determining the error type of the uncorrectable error; if not, determining whether the error is correctable according to the set status bit; if the error is not correctable, judging that no correctable error exists; if the error can be corrected, reading the total occurrence number of the correctable errors, and judging whether the total occurrence number of the current reading is increased compared with the total occurrence number of the previous reading; if the total occurrence frequency of the current reading is increased compared with the total occurrence frequency of the last reading, judging that correctable errors exist, and determining the error type of the correctable errors; if the total occurrence number of the current reading is not increased compared with the total occurrence number of the last reading, it is determined that no correctable error exists.
That is, after judging whether the error is an uncorrectable error which does not need to be processed according to the status bit, if so, comparing the error code of the status information with the MCA status register list data, analyzing a specific error type, recording the MCA register number and the error type into a log system, and displaying in real time; if not, the error is continuously judged whether to be correctable according to the status bit. If not, executing S308, continuing to read the basic information of the next target MCA status register, and executing S303; if yes, whether the number of correctable errors and the number of last reading are increased is judged, wherein the MCA status register records the total number of correctable errors, and when the adjacent two times of BMC reads data, whether the correctable errors occur in the two times of reading interval time can be determined by judging whether the total occurrence times are increased. If the total occurrence number is not increased, which indicates that there is no newly increased correctable error, determining the next target MCA status register to be detected from the MCA status register list data, and continuing to execute S303; if the total occurrence times are increased, comparing the error codes of the state information with MCA state register list data, analyzing specific error types, recording the MCA register number, the correctable error number and the error types into a log system, and displaying in real time for a user to look up in time; then, processing of the next target MCA status register is started according to the MCA register number in the MCA status register list data. If each MCA status register in the MCA status register list data is detected, the next round of detection is continued.
In conclusion, the scheme provides a polling processing error reporting scheme. In the scheme, the BMC polls the error reporting state of system hardware through the PECI interface, records, displays and clears correctable errors in a correctable state register of PCIE equipment in real time, and records and displays uncorrectable errors and correctable errors which do not need to be processed in an MCA state register in real time. Moreover, the correctable error and the uncorrectable error which does not need to be processed do not generate SMI interruption when being generated, so that the generation quantity of the SMI interruption in the system can be greatly reduced by the mode, the switching times of a normal operation mode and a system management mode are reduced, and the performance of the system is improved.
The processing apparatus, and the processing medium according to the embodiments of the present invention are described below, and the processing apparatus, and the processing medium described below may be referred to the processing method described above.
Referring to fig. 6, a schematic structural diagram of an apparatus error reporting processing device according to an embodiment of the present invention is applied to a baseboard management controller, and includes:
the receiving module 11 is configured to receive an error detection instruction;
the detection module 12 is configured to cyclically detect whether a correctable error or an uncorrectable error exists according to the error detection instruction; wherein the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur;
and a recording module 13, configured to record the detected error when a correctable error or an uncorrectable error is detected.
Wherein the detection module comprises:
the first determining unit is used for circularly determining target PCIE equipment to be detected from the PCIE equipment list data;
the first reading module is used for reading register data of the correctable state register from the target PCIE equipment;
the first judging unit is used for judging whether the correctable error state bit is set or not according to the register data; if not, judging that no correctable error exists in the target PCIE equipment; if yes, triggering a second determining module;
and the second determination unit is used for determining that correctable errors exist in the target PCIE equipment, determining the error type corresponding to the set status bit and causing the error reporting PCIE equipment.
Wherein the recording module comprises:
the first recording unit is configured to record, when a correctable error exists in the target PCIE device, the PCIE device causing the error report, and the error type to the log system.
Wherein, the detection module further comprises:
a sending unit, configured to send a status bit clearing instruction to the target PCIE device, so as to clear the status bit set in the correctable status register of the target PCIE device.
Wherein the detection module comprises:
a third determining unit, configured to determine a target MCA status register to be detected in a circulating manner from the MCA status register list data;
a second reading module, configured to read status data from the target MCA status register;
a second determination unit for determining whether an error state is set according to the state data; if not, judging that no correctable error or uncorrectable error exists in the target MCA state register; if yes, triggering a third judging unit;
a third judging unit for determining whether there is a correctable error or an uncorrectable error according to the set status bit; if the correctable error or the uncorrectable error exists, triggering a fourth determining unit; if no correctable error or uncorrectable error exists, continuing to trigger the third determining unit;
and the fourth determining unit is used for determining the corresponding error type.
Wherein the third judging unit includes:
a first judging subunit, configured to determine whether the error is an uncorrectable error according to the set status bit; if the error is an uncorrectable error, triggering the fourth determining unit to determine the error type of the uncorrectable error; if the error is not an uncorrectable error, triggering a second judgment subunit;
the second judging subunit is used for determining whether the error can be corrected according to the set status bit; if the error is not correctable, judging that no correctable error exists; if the error can be corrected, triggering the reading subunit;
the reading subunit is used for reading the total occurrence number of correctable errors;
a third judging subunit, configured to judge whether the total occurrence count of the current reading is increased compared with the total occurrence count of the previous reading; if the total occurrence frequency of the current reading is increased compared with the total occurrence frequency of the last reading, judging that correctable errors exist, and triggering the fourth determining unit to determine the error type of the correctable errors; if the total occurrence number of the current reading is not increased compared with the total occurrence number of the last reading, it is determined that no correctable error exists.
Wherein the recording module comprises:
the second recording unit is used for recording the target MCA state register and the error type of the uncorrectable error to the log system when the uncorrectable error exists;
and the third recording unit is used for recording the target MCA status register, the error type of the correctable error and the total occurrence number of the correctable errors to the log system when the correctable errors exist.
Referring to fig. 7, a schematic structural diagram of an electronic device provided in an embodiment of the present invention includes:
a memory 21 for storing a computer program;
a processor 22, configured to implement the steps of the method for handling error of device according to any of the above method embodiments when executing the computer program.
In this embodiment, the device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet Computer, a palmtop Computer, or a portable Computer.
The device may include a memory 21, a processor 22, and a bus 23.
The memory 21 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 21 may in some embodiments be an internal storage unit of the device, for example a hard disk of the device. The memory 21 may also be an external storage device of the device in other embodiments, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) Card, Flash memory Card (Flash Card), etc. provided on the device. Further, the memory 21 may also include both an internal storage unit of the device and an external storage device. The memory 21 may be used not only to store application software installed in the device and various types of data such as program codes for executing an error processing method, etc., but also to temporarily store data that has been output or is to be output.
The processor 22, which may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip, is used for executing program codes stored in the memory 21 or Processing data, such as program codes for executing error handling methods.
The bus 23 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
Further, the device may further include a network interface 24, and the network interface 24 may optionally include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are generally used to establish a communication connection between the device and other electronic devices.
Optionally, the device may further comprise a user interface 25, the user interface 25 may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 25 may also comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the device and for displaying a visualized user interface.
Fig. 7 shows only the device with the components 21-25, and it will be understood by those skilled in the art that the structure shown in fig. 7 does not constitute a limitation of the device, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the device error reporting processing method according to any of the above method embodiments.
Wherein the storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An apparatus error reporting processing method is characterized by comprising:
the baseboard management controller receives an error detection instruction;
detecting whether a correctable error or an uncorrectable error exists according to the error detection instruction in a circulating mode; wherein the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur;
if so, recording the detected error.
2. The method of claim 1, wherein the cyclically detecting whether the correctable error exists according to the error detection instruction comprises:
circularly determining target PCIE equipment to be detected from the PCIE equipment list data;
reading register data of a correctable state register from target PCIE equipment;
judging whether a correctable error state bit is set or not according to the register data;
if not, judging that no correctable errors exist in the target PCIE equipment;
if yes, the fact that correctable errors exist in the target PCIE equipment is judged, the error type corresponding to the set status bit is determined, and the PCIE equipment giving error reports is determined.
3. The apparatus error handling method according to claim 2, wherein if it is determined that a correctable error exists in a target PCIE device, the recording the detected error includes:
and recording the target PCIE equipment, the PCIE equipment causing the error report and the error type to a log system.
4. The device error handling method according to claim 3, wherein after recording the target PCIE device, the PCIE device causing error reporting, and the error type to a log system, the method further comprises:
and sending a status bit clearing instruction to the target PCIE device so as to clear the set status bit in the correctable status register of the target PCIE device.
5. The apparatus error handling method according to claim 1, wherein the detecting whether there is a correctable error or an uncorrectable error according to the error detection instruction loop comprises:
circularly determining a target MCA state register to be detected from MCA state register list data;
reading status data from a target MCA status register;
determining whether an error condition is set based on the status data; if not, judging that no correctable error or uncorrectable error exists in the target MCA state register;
if yes, determining whether a correctable error or an uncorrectable error exists according to the set status bit; if the correctable errors or the uncorrectable errors exist, determining the corresponding error type; and if the correctable error and the uncorrectable error do not exist, continuously executing the step of circularly determining the target MCA status register to be detected from the MCA status register list data.
6. The apparatus error reporting processing method of claim 5, wherein the determining whether a correctable error or an uncorrectable error exists according to the set status bit, and if the correctable error or the uncorrectable error exists, determining a corresponding error type comprises:
determining whether the error is an uncorrectable error according to the set status bit;
if the error is an uncorrectable error, determining the error type of the uncorrectable error; if not, determining whether the error is correctable according to the set status bit;
if the error is not correctable, judging that no correctable error exists; if the error can be corrected, reading the total occurrence number of the correctable errors, and judging whether the total occurrence number of the current reading is increased compared with the total occurrence number of the previous reading;
if the total occurrence frequency of the current reading is increased compared with the total occurrence frequency of the last reading, judging that correctable errors exist, and determining the error type of the correctable errors;
if the total occurrence number of the current reading is not increased compared with the total occurrence number of the last reading, it is determined that no correctable error exists.
7. The method as claimed in claim 5, wherein if there is a correctable error or an uncorrectable error, the recording the detected error comprises:
if the uncorrectable error exists, recording the target MCA status register and the error type of the uncorrectable error to a log system;
if correctable errors are present, the target MCA status register, the type of error that correctable errors and the total number of occurrences of correctable errors are logged to the log system.
8. The utility model provides an equipment error reporting processing apparatus which is characterized in that, is applied to the base plate management controller, includes:
the receiving module is used for receiving an error detection instruction;
the detection module is used for circularly detecting whether correctable errors or uncorrectable errors exist according to the error detection instruction; wherein the uncorrectable error is an uncorrectable error which does not need to be processed, and the correctable error and the uncorrectable error do not trigger system management interruption after the uncorrectable error and the uncorrectable error occur;
and the recording module is used for recording the detected errors when the correctable errors or the uncorrectable errors are detected.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of error handling of the apparatus as claimed in any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for handling error of a device according to any one of claims 1 to 7.
CN202210331196.7A 2022-03-31 2022-03-31 Equipment error reporting processing method, device, equipment and storage medium Pending CN114661511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210331196.7A CN114661511A (en) 2022-03-31 2022-03-31 Equipment error reporting processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210331196.7A CN114661511A (en) 2022-03-31 2022-03-31 Equipment error reporting processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114661511A true CN114661511A (en) 2022-06-24

Family

ID=82032759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210331196.7A Pending CN114661511A (en) 2022-03-31 2022-03-31 Equipment error reporting processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114661511A (en)

Similar Documents

Publication Publication Date Title
CN102567177B (en) System and method for detecting error of computer system
CN111767184A (en) Fault diagnosis method and device, electronic equipment and storage medium
CN109558282A (en) A kind of PCIE chain circuit detecting method, system and electronic equipment and storage medium
CN115934389A (en) System and method for error reporting and handling
CN111858122A (en) Fault detection method, device, equipment and storage medium of storage link
CN109885521A (en) A kind of interruption processing method, system and electronic equipment and storage medium
US8769252B2 (en) Computer system and method for resetting the same
CN113590405A (en) Hard disk error detection method and device, storage medium and electronic device
TW201510995A (en) Method for maintaining file system of computer system
CN108984377B (en) Method, system and medium for counting BIOS log
TW201933091A (en) A system and a method for testing a data storage device
CN110308940B (en) Method for realizing remote soft-off by simulating keyboard soft-off key
WO2023185335A1 (en) Crash clustering method and apparatus, electronic device and storage medium
CN114661511A (en) Equipment error reporting processing method, device, equipment and storage medium
US11995014B2 (en) Bus exception handling method and apparatus, electronic device and readable storage medium
CN110008105A (en) A kind of BMC time reservation method, device and electronic equipment and storage medium
CN111966521B (en) Hardware error processing method, processor, controller, electronic device and storage medium
CN110704219B (en) Hardware fault reporting method and device and computer storage medium
CN109710495B (en) Information processing method and electronic equipment
CN111881065B (en) Physical address processing method, device, equipment and medium for data deduplication operation
JP6318976B2 (en) DEBUG CIRCUIT, DEBUGGER DEVICE, SEMICONDUCTOR DEVICE, AND DEBUG METHOD
CN108279991B (en) Method for quickly positioning problem of computer equipment halt rate
CN111786856A (en) Browser environment checking method and device
CN112463446B (en) PCIe device recovery method and system, electronic device and storage medium
CN112256467B (en) Error type judging system and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination