CN117271190A - Hardware correctable error processing method and system - Google Patents

Hardware correctable error processing method and system Download PDF

Info

Publication number
CN117271190A
CN117271190A CN202311253309.7A CN202311253309A CN117271190A CN 117271190 A CN117271190 A CN 117271190A CN 202311253309 A CN202311253309 A CN 202311253309A CN 117271190 A CN117271190 A CN 117271190A
Authority
CN
China
Prior art keywords
error
register
hardware
correctable
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311253309.7A
Other languages
Chinese (zh)
Inventor
刘骏
张旭芳
魏浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202311253309.7A priority Critical patent/CN117271190A/en
Publication of CN117271190A publication Critical patent/CN117271190A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

The invention provides a hardware correctable error processing method and a system, wherein the method comprises the following steps: acquiring error register grouping number information corresponding to the hardware equipment when correctable errors occur at the current moment through preset register bits in a global control model special register; determining a target error register group from a plurality of error register groups according to the error register group number information, wherein the target error register group stores hardware error data generated when the hardware equipment generates a correctable error at the current moment; and acquiring the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result. The invention does not need to traverse and check the related register information of all error register groups, avoids excessive invalid resource occupation, and improves the efficiency and the speed of processing hardware faults of the system.

Description

Hardware correctable error processing method and system
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and a system for processing a hardware correctable error.
Background
The correctable machine check interrupt (Corrected Machine Check Interrupts, abbreviated as CMCI) architecture is used as an enhancement characteristic of the machine check architecture (Machine Check Architecture, abbreviated as MCA), and is mainly used for reporting types of faults such as hardware Corrected Error (CE) and Error which cannot be Corrected by hardware but does not need to be focused (Uncorrected No Action Required, abbreviated as UCNA) to system software in an interrupt mode, the system software executes an interrupt processing function after receiving an interrupt signal, so that Error information is recorded in an Error buffer, and user mode application can read the fault information in the Error buffer and perform corresponding processing.
In the current error handling mechanism, for each CE fault, the substrate input/output system (Basic Input Output System, abbreviated as BIOS) will report to the system software through CMCI interrupt, and when the system software processes the interrupt, it needs to check the condition of the register information corresponding to each error register group (i.e. BANK), which causes great resource waste and affects the system performance.
Accordingly, there is a need for a hardware-correctable error handling method and system that addresses the above issues.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a hardware correctable error processing method and a system.
The invention provides a hardware correctable error processing method, which comprises the following steps:
acquiring error register grouping number information corresponding to the hardware equipment when correctable errors occur at the current moment through preset register bits in a global control model special register;
determining a target error register group from a plurality of error register groups according to the error register group number information, wherein the target error register group stores hardware error data generated when the hardware equipment generates a correctable error at the current moment;
and acquiring the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result.
According to the method for processing the hardware correctable errors provided by the invention, before the preset register bit in the special register of the global control model is used for obtaining the corresponding error register grouping number information of the hardware equipment when the correctable errors occur at the current moment, the method further comprises the following steps:
When the hardware equipment is determined to generate a correctable error at the current moment, acquiring correctable error data of the hardware equipment through a substrate input/output system;
writing the correctable error data into the corresponding target error register group, and acquiring the group number information of the target error register group to obtain the error register group number information;
and writing the error register grouping number information into a preset register bit of the global control model special register.
According to the method for processing the hardware correctable errors, when the hardware equipment is determined to generate the correctable errors at the current moment, the correctable error data of the hardware equipment are obtained through a substrate input and output system, and the method comprises the following steps:
determining that the hardware equipment generates a correctable error at the current moment according to a hardware system management interrupt signal, wherein the hardware system management interrupt signal is generated when the hardware equipment generates the correctable error;
performing error correction processing on the acquired correctable error data through the substrate input/output system, and determining first correctable error data and second correctable error data, wherein the first correctable error data is correctable error data with successful error correction processing, and the second correctable error data is correctable error data with failed error correction processing;
The writing the correctable error data into the corresponding target error register group includes:
and writing the first correctable error data and the second correctable error data into the corresponding target error register groups through the substrate input-output system.
According to the hardware correctable error processing method provided by the invention, before writing the correctable error data into the corresponding target error register packet and acquiring the packet number information of the target error register packet, the method further comprises the steps of:
receiving a first input, the first input comprising an operation of selecting a register bit in the global control model specific register for recording error register packet number information;
setting corresponding register bits in the global control model specific register to the preset register bits in response to the first input;
the writing the error register packet number information into the preset register bit of the global control model special register includes:
judging a register value corresponding to a preset register bit to be written with the error register grouping number information, if a preset threshold condition is met, determining the preset register to be written as a target register bit, and writing the error register grouping number information into the target register bit.
According to the hardware correctable error processing method provided by the invention, after the error register packet number information is written into the preset register bit of the global control model special register, the method further comprises:
generating a correctable machine inspection interrupt signal through the substrate input output system;
based on the correctable machine check interrupt signal, reading the preset register bit in the global control model special register, and acquiring the error register grouping number information corresponding to the hardware equipment when the correctable error occurs at the current moment.
According to the hardware correctable error processing method provided by the invention, the method further comprises the following steps:
judging whether the number of the error register groups is larger than a preset number, and if so, expanding the number of the preset register bits in the global control model special register according to the number of the error register groups.
According to the method for processing the hardware correctable errors provided by the invention, after the hardware error data in the target error register group are obtained and processed, the method further comprises the steps of:
Generating corresponding hardware error log information according to the hardware correctable error processing result;
and storing the hardware error log information into a system log, and displaying the hardware error log information in the system log through a display device.
The invention also provides a hardware-correctable error processing system, comprising:
the first processing module is used for acquiring the corresponding error register grouping number information when the hardware equipment generates correctable errors at the current moment through the preset register bit in the global control model special register;
an error register packet determining module, configured to determine a target error register packet from a plurality of error register packets according to the error register packet number information, where the target error register packet stores hardware error data generated when the hardware device generates a correctable error at a current time;
and the second processing module is used for acquiring the hardware error data in the target error register group and processing the hardware error data to obtain a hardware correctable error processing result.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a hardware-correctable error handling method as described in any of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a hardware-correctable error processing method as described in any of the above.
According to the hardware correctable error processing method and system, the error register grouping number information of the hardware equipment when correctable errors occur at the current moment is obtained through the preset register bit in the special register of the global control model, so that only the error register grouping corresponding to the error register grouping number information is required to be read, related register information of all error register groupings does not need to be traversed and checked, excessive invalid resource occupation is avoided, and the efficiency and speed of processing hardware faults of the system are improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a hardware correctable error handling method according to the present invention;
FIG. 2 is a schematic diagram of a correctable error based register architecture according to the present invention;
FIG. 3 is a schematic diagram of an overall flow of a hardware correctable error handling method according to the present invention;
FIG. 4 is a schematic diagram of a hardware-based error-correcting system according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
MCA is an error handling architecture/mechanism of a central processing unit (Central Processing Unit, abbreviated as CPU) for checking server hardware errors, through which both correctable errors and uncorrectable errors inside the CPU can be reported to system software records.
When the CPU detects that the internal hardware is wrong or the bus is wrong, the MCA mechanism can try to repair the error as much as possible, and report the error to the system software through interruption, and the scope comprises all modules in the CPU, so that a fault source can be identified and fault information can be recorded in a relevant BANK register. In order to implement classification and grading of errors, the MCA adopts a mechanism using BANK as a processing unit, where BANK represents a group of error registers, each hardware unit is associated with one BANK, and specifically represents an error source of the hardware unit, for example IFU, DCU, DTLB, MLC, PCU, UPI, IIO, M M, CHA, iMC, and the like, so that a plurality of global control model special registers (Model Specific Register, abbreviated as MSR) contained in BANK can be used to record the error information.
After receiving the interrupt or abnormal signal, the system software responds to the interrupt or abnormal signal and performs corresponding actions such as error repair, alarm or other strategies, so that the fault tolerance treatment is performed before the fault is avoided. Since the MCA mechanism samples hardware errors in units of time windows, at the end of each sample, it is possible to find that more than one error is generated, but only one interrupt or exception is triggered. Therefore, when the system software performs error processing, it is necessary to check all of the BANK-related CPU MSR registers, for example, CPU ia32_mci_ctl, ia32_mci_status, ia32_mci_addr, ia32_mci_misc, and ia32_mci_ctl2, to ensure that each generated error is processed.
The firmware priority mode (Firmware First Mode, abbreviated as FFM) function provided by the first generation enhancement model EMCA Gen1 and the second generation enhancement model EMCA Gen2 of the MCA architecture can implement a hardware system management interrupt (System Management Interrupt, abbreviated as SMI) before the hardware fault is reported to the system software, each fault is sent to the BIOS for preprocessing, the BIOS will attempt to correct the error first, and error correction data is written into the relevant CPU MSR registers such as the IA32_mci_ctl, IA32_mci_status, IA32_mci_addr, IA32_mci_misc, IA32_mci_ctl2 of the corresponding BANK, and the like, and then sent to the system software for processing through the CMCI interrupt.
After receiving the CMCI interrupt, the system software needs to read all the MSR register information related to the BANK, reads the error information from the error buffer through user mode application, processes the error information correspondingly and records the processed error information in a system log.
Because in the error processing mechanisms of the current EMCA Gen1 and the current EMCA Gen2, for each CE fault, the BIOS reports the CE fault to the system software through the CMCI interrupt, and when the system software processes the interrupt, the system software needs to check the related MSR register information of each BANK, so that great resource waste is caused, time consumption is increased, and system performance is lost.
In view of the above problems, the present invention provides a method for improving the correctable error handling performance of a system, which can reduce the number of times of reading excessive hardware-related registers without reporting errors when the system processes a CMCI interrupt, and improve the efficiency and speed of processing hardware faults of the system. When the BIOS starts a priority firmware mode and performs preprocessing on each correctable error, the invention writes the BANK number information with faults into the preset position of the related MSR register, and then reports the information to system software for processing through CMCI interrupt; and the system software only checks the BANK fault information corresponding to the error according to the BANK number information corresponding to the position of the preset MSR register without checking all the BANK registers, and further performs corresponding processing through the user mode application and records the processed BANK fault information into a system log. By the method provided by the invention, the related MSR register information of all BANK is not required to be traversed and checked, so that excessive invalid CPU resource occupation is avoided, the efficiency and the speed of processing hardware faults of the system are improved, the stability of the system is further ensured, and the service performance is improved.
Fig. 1 is a flow chart of a hardware correctable error processing method according to the present invention, and as shown in fig. 1, the present invention provides a hardware correctable error processing method, including:
Step 101, obtaining the corresponding error register grouping number information when the hardware device generates correctable errors at the current moment through the preset register bit in the global control model special register.
The MCA mechanism uses a set of hardware-dependent 64-bit MSR registers to detect hardware errors and record detected error information, and for a hardware error, four global control MSR registers and multiple BANK-containing MSR registers may be used to record, through which the hardware error is detected and the detected error information is recorded. In the present invention, BANK (i.e., target error register group) storing error information (i.e., hardware error data) of a hardware device at a current time may be stored in preset register bits in the global control MSR register, so that when the system software processes an interrupt, only the corresponding preset register bits in the global control MSR register are read, and error register group number information corresponding to when a correctable error occurs in the hardware device at the current time is obtained.
Step 102, determining a target error register packet from a plurality of error register packets according to the error register packet number information, wherein the target error register packet stores hardware error data generated when the hardware device generates a correctable error at the current moment;
In the invention, when system software processes interrupt, the BANK number information of the written hardware error data is obtained by reading the preset register bit corresponding to the global control MSR register, and all BANK registers do not need to be traversed and checked any more, and only the BANK registers with corresponding hardware errors need to be checked according to the BANK number information obtained from the global control MSR.
And step 103, obtaining the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result.
In the invention, after the obtained hardware error data is analyzed, the user mode application is utilized to carry out corresponding processing on the hardware error data and record the processed data into a system log, thereby realizing the processing of correctable errors of hardware equipment under the condition of not traversing and checking the related MSR register information of all BANK.
According to the hardware correctable error processing method provided by the invention, the error register grouping number information of the hardware equipment when correctable errors occur at the current moment is obtained through the preset register bit in the global control model special register, so that only the error register grouping corresponding to the error register grouping number information is required to be read, the related register information of all the error register grouping is not required to be traversed and checked, excessive invalid resource occupation is avoided, and the efficiency and the speed of processing hardware faults of the system are improved.
On the basis of the above embodiment, before the obtaining, by the preset register bit in the global control model specific register, the error register packet number information corresponding to when the hardware device generates the correctable error at the current time, the method further includes:
when the hardware equipment is determined to generate a correctable error at the current moment, acquiring correctable error data of the hardware equipment through a substrate input/output system;
writing the correctable error data into the corresponding target error register group, and acquiring the group number information of the target error register group to obtain the error register group number information;
and writing the error register grouping number information into a preset register bit of the global control model special register.
In the invention, when the hardware device generates the correctable error, the BIOS preprocesses each correctable error under the condition of starting the priority firmware mode, writes error information into the corresponding CPU MSR register related to the BANK, writes the error BANK number information into the reserved preset bit (i.e. the preset register bit) of the global control MSR register, and then reports the error information to the system software for processing through the CMCI interrupt. FIG. 2 is a schematic diagram of a register architecture based on correctable errors according to the present invention, and referring to FIG. 2, each BANK (i.e. Error-Reporting Bank Registers) includes five MSR registers, for example, BANK1: ia32_mc1_ctl, ia32_mc1_status, ia32_mc1_addr, ia32_mc1_misc and ia32_mc1_ctl2; BANK2: ia32_mc2_ctl, ia32_mc2_status, ia32_mc2_addr, ia32_mc2_misc and ia32_mc2_ctl2;
Wherein the number of BANK and the type of error source stored depend on the specific CPU model.
The global control MSR register (i.e., global Control MSRs) includes: ia32_mcg_cap, ia32_mcg_status, ia32_mcg_ctl and ia32_mcg_ext_ctl, the latter two of which are optional.
In the present invention, after the correctable error data is obtained, it is necessary to write the data into the corresponding target error register group so as to be distinguished from other types of error data for subsequent analysis and processing. Meanwhile, the packet number information of the target error register packet is acquired, and the packet number information can uniquely identify the target error register packet, so that subsequent access and operation are facilitated. Further, in order to save and manage the error register packet number information, writing this information into the preset register bits of the global control MSR can facilitate the system software to quickly acquire and use the information when needed.
On the basis of the above embodiment, when it is determined that the hardware device has a correctable error at the current time, the obtaining, by the substrate input output system, correctable error data of the hardware device includes:
determining that the hardware equipment generates a correctable error at the current moment according to a hardware system management interrupt signal, wherein the hardware system management interrupt signal is generated when the hardware equipment generates the correctable error;
Performing error correction processing on the acquired correctable error data through the substrate input/output system, and determining first correctable error data and second correctable error data, wherein the first correctable error data is correctable error data with successful error correction processing, and the second correctable error data is correctable error data with failed error correction processing;
the writing the correctable error data into the corresponding target error register group includes:
and writing the first correctable error data and the second correctable error data into the corresponding target error register groups through the substrate input-output system.
In the invention, according to the hardware system management interrupt signal (SMI interrupt), whether the hardware device generates a correctable error at the current moment is judged, namely, after receiving the SMI interrupt, the BIOS can judge that the hardware device has the correctable error at the moment. Then, the BIOS performs correctable error preprocessing on the obtained correctable error data, namely, attempts to perform error correction processing, so that the influence of errors on the system performance can be reduced to the greatest extent. Because the errors are corrected in time, the system can continue to operate normally, the problems of interruption, delay or performance reduction and the like caused by the errors are reduced, and further, the first correctable error data and the second correctable error data are determined according to the preprocessing result.
Further, the BIOS is utilized to store the error data which is subjected to error correction processing and successfully corrected into a target error register group to be used as first correctable error data for subsequent analysis and processing; meanwhile, the error data which cannot be successfully corrected is also written into the target error register group to serve as second correctable error data, so that the error information of the hardware equipment can be better recorded and managed, and further fault diagnosis and correction can be achieved.
On the basis of the foregoing embodiment, before writing the correctable error data into the corresponding target error register packet and acquiring packet number information of the target error register packet, the method further includes:
receiving a first input, the first input comprising an operation of selecting a register bit in the global control model specific register for recording error register packet number information;
setting corresponding register bits in the global control model specific register to the preset register bits in response to the first input;
the writing the error register packet number information into the preset register bit of the global control model special register includes:
Judging a register value corresponding to a preset register bit to be written with the error register grouping number information, if a preset threshold condition is met, determining the preset register to be written as a target register bit, and writing the error register grouping number information into the target register bit.
In the present invention, there are some reserved bits in the global control MSR register, and the preset register bits can be determined in the global control MSR register based on the corresponding input operation in the early stage. Specifically, the present invention sets IA32_MCG_STATUS and IA32_MCG_CAP.MCG_EXT_P9 in the global control MSR register as preset register bits, wherein IA32_MCG_STATUS is used to describe the state of the current processor after an error is generated, the 64-bit register uses only 0-3 bits, and 4-63 bits are reserved bits, so that the bit can be used as preset register bits; in the present invention, when the value of ia32_mcg_cap.mcg_ext_p9 is 1, more extended MSR registers can be turned on, so that more detailed error information can be recorded, and the preset register bit is set.
On the basis of the above embodiment, after the writing of the error register packet number information into the preset register bit of the global control model-specific register, the method further includes:
generating a correctable machine inspection interrupt signal through the substrate input output system;
based on the correctable machine check interrupt signal, reading the preset register bit in the global control model special register, and acquiring the error register grouping number information corresponding to the hardware equipment when the correctable error occurs at the current moment.
After the existing BIOS reports the CMCI interrupt signal to the system software, the system software needs to traverse each BANK to find the error information. In the invention, BIOS writes the error BANK number information into the preset register bit of the corresponding global control MSR while generating CMCI interrupt signal, when system software obtains CMCI interrupt signal, it does not traverse all BANK, but obtains the error register group number information corresponding to the correctable error at the present moment from the global control MSR, and then only checks the error information of the corresponding BANK according to the error register group number information, thereby improving the efficiency of the system software in processing hardware fault.
On the basis of the above embodiment, the method further includes:
judging whether the number of the error register groups is larger than a preset number, and if so, expanding the number of the preset register bits in the global control model special register according to the number of the error register groups.
In the invention, more than twenty BANK exist for the existing CPU, so at least 5 bits (the 5 th power of 2 is equal to 32) are needed in the binary register bits, namely, the binary number of 5 bits is used, namely, at most 32 BANK numbers can be represented, if the BANK number of the subsequent CPU is increased, the number can be expanded to 6 bits (the 6 th power of 2 is equal to 64), and 64 BANK numbers can be supported to be recorded. Accordingly, when the BANK is increased, the reserved bits (i.e., the preset register bits) in the global control MSR are synchronously increased for expansion. By expanding the number of preset register bits, it is ensured that the system can effectively record and manage more error register groups, provide more capacity to store and track multiple error information, and provide more comprehensive data support for subsequent fault handling and analysis.
On the basis of the foregoing embodiment, after the obtaining the hardware error data in the target error register packet and processing the hardware error data to obtain a hardware correctable error processing result, the method further includes:
Generating corresponding hardware error log information according to the hardware correctable error processing result;
and storing the hardware error log information into a system log, and displaying the hardware error log information in the system log through a display device.
In the present invention, based on the hardware-correctable error processing results, related information and data, such as error codes, fault locations, error types, etc., can be obtained therefrom, based on which corresponding hardware error log information is generated, while time stamps and other metadata are added for better tracking and analysis of errors for subsequent troubleshooting and repair. Further, the generated hardware error log information is saved in a system log, and the system log is a record file for recording the running state, the event and the error of the system. The hardware error log information is stored in the system log, and the hardware error log information in the system log is displayed through a display device (such as a monitoring screen, a terminal interface and the like), so that related personnel can be helped to quickly find and solve hardware faults, and necessary maintenance or replacement actions can be performed.
In an embodiment, the overall description of the hardware-correctable error processing method provided by the present invention is given, and fig. 3 is a schematic overall flow diagram of the hardware-correctable error processing method provided by the present invention, and referring to fig. 3, when a hardware-correctable error occurs, the hardware firstly sends each fault data to the BIOS for processing through an SMI interrupt; then, under the condition that the BIOS starts the priority firmware mode, firstly, attempting to correct errors, writing error data into a CPU MSR register corresponding to the BANK, and writing the BANK number information with the errors into a preset reserved bit of the global control MSR register by the BIOS; further, after the BIOS completes writing of the related registers of the BANK, the BIOS reports the related registers to system software for subsequent processing through CMCI interrupt; when the system software processes the CMCI interrupt, the register preset bit corresponding to the global control MSR register is read preferentially, so that the error BANK number information is obtained, at the moment, the system software does not need to traverse and check all the BANK registers, and only needs to check the corresponding error BANK registers according to the obtained BANK number information, so as to obtain and analyze the error information; and finally, the system software reads error information from the error buffer through user mode application, processes the error information correspondingly and records the processed error information into a system log.
The hardware-correctable error processing system provided by the present invention will be described below, and the hardware-correctable error processing system described below and the hardware-correctable error processing method described above may be referred to correspondingly with each other.
Fig. 4 is a schematic structural diagram of a hardware correctable error processing system provided by the present invention, and as shown in fig. 4, the present invention provides a hardware correctable error processing system, which includes a first processing module 401, an error register grouping determining module 402, and a second processing module 403, where the first processing module 401 is configured to obtain, through a preset register bit in a global control model dedicated register, error register grouping number information corresponding to when a correctable error occurs in hardware equipment at a current time; the error register packet determining module 402 is configured to determine a target error register packet from a plurality of error register packets according to the error register packet number information, where the target error register packet stores hardware error data generated when the hardware device generates a correctable error at a current time; the second processing module 403 is configured to obtain the hardware error data in the target error register packet, and process the hardware error data to obtain a hardware correctable error processing result.
In the present invention, the BANK (i.e. the target error register group) storing the error information (i.e. the hardware error data) of the hardware device at the current time, the corresponding BANK number (i.e. the error register group number information) may be stored in the preset register bit in the global control MSR register, and the first processing module 401 only reads the corresponding preset register bit in the global control MSR register, so that when the system software processes the interrupt, the corresponding error register group number information when the hardware device generates the correctable error at the current time is obtained according to the related BANK number information of the preset register bit.
When the system software processes the interrupt, the first processing module 401 acquires the BANK number information of the written hardware error data by reading the preset register bit corresponding to the global control MSR register, and does not need to traverse and check all the BANK registers, and the error register grouping determining module 402 checks the BANK register with the corresponding hardware error according to the BANK number information acquired from the global control MSR.
Finally, after resolving the obtained hardware error data, the second processing module 403 uses the user mode application to perform corresponding processing on the hardware error data, and records the processed data in the system log, thereby realizing the processing of correctable errors of the hardware device without traversing and checking the related MSR register information of all BANK.
According to the hardware correctable error processing system provided by the invention, the error register grouping number information of the hardware equipment when correctable errors occur at the current moment is obtained through the preset register bit in the global control model special register, so that only the error register grouping corresponding to the error register grouping number information is required to be read, the related register information of all the error register grouping is not required to be traversed and checked, excessive invalid resource occupation is avoided, and the efficiency and the speed of processing hardware faults of the system are improved.
The system provided by the invention is used for executing the method embodiments, and specific flow and details refer to the embodiments and are not repeated herein.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include: processor (Processor) 501, communication interface (Communications Interface) 502, memory (Memory) 503 and communication bus 504, wherein Processor 501, communication interface 502, memory 503 complete the communication between each other through communication bus 504. The processor 501 may invoke logic instructions in the memory 503 to perform a hardware-correctable error handling method comprising: acquiring error register grouping number information corresponding to the hardware equipment when correctable errors occur at the current moment through preset register bits in a global control model special register; determining a target error register group from a plurality of error register groups according to the error register group number information, wherein the target error register group stores hardware error data generated when the hardware equipment generates a correctable error at the current moment; and acquiring the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result.
Further, the logic instructions in the memory 503 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the hardware-correctable error processing methods provided by the methods described above, the method comprising: acquiring error register grouping number information corresponding to the hardware equipment when correctable errors occur at the current moment through preset register bits in a global control model special register; determining a target error register group from a plurality of error register groups according to the error register group number information, wherein the target error register group stores hardware error data generated when the hardware equipment generates a correctable error at the current moment; and acquiring the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the hardware-correctable error processing method provided by the above embodiments, the method comprising: acquiring error register grouping number information corresponding to the hardware equipment when correctable errors occur at the current moment through preset register bits in a global control model special register; determining a target error register group from a plurality of error register groups according to the error register group number information, wherein the target error register group stores hardware error data generated when the hardware equipment generates a correctable error at the current moment; and acquiring the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of hardware-correctable error processing, comprising:
acquiring error register grouping number information corresponding to the hardware equipment when correctable errors occur at the current moment through preset register bits in a global control model special register;
determining a target error register group from a plurality of error register groups according to the error register group number information, wherein the target error register group stores hardware error data generated when the hardware equipment generates a correctable error at the current moment;
and acquiring the hardware error data in the target error register group, and processing the hardware error data to obtain a hardware correctable error processing result.
2. The method for processing a hardware correctable error according to claim 1, wherein before the obtaining, by the preset register bit in the global control model specific register, error register packet number information corresponding to when a correctable error occurs in the hardware device at a current time, the method further comprises:
when the hardware equipment is determined to generate a correctable error at the current moment, acquiring correctable error data of the hardware equipment through a substrate input/output system;
Writing the correctable error data into the corresponding target error register group, and acquiring the group number information of the target error register group to obtain the error register group number information;
and writing the error register grouping number information into a preset register bit of the global control model special register.
3. The method for processing a hardware correctable error according to claim 2, wherein when it is determined that the hardware device has a correctable error at the current time, acquiring, through a substrate input output system, the correctable error data of the hardware device, comprises:
determining that the hardware equipment generates a correctable error at the current moment according to a hardware system management interrupt signal, wherein the hardware system management interrupt signal is generated when the hardware equipment generates the correctable error;
performing error correction processing on the acquired correctable error data through the substrate input/output system, and determining first correctable error data and second correctable error data, wherein the first correctable error data is correctable error data with successful error correction processing, and the second correctable error data is correctable error data with failed error correction processing;
The writing the correctable error data into the corresponding target error register group includes:
and writing the first correctable error data and the second correctable error data into the corresponding target error register groups through the substrate input-output system.
4. The hardware-correctable error processing method of claim 2, wherein before the writing the correctable error data into the corresponding target error register packet and acquiring packet number information of the target error register packet, the method further comprises:
receiving a first input, the first input comprising an operation of selecting a register bit in the global control model specific register for recording error register packet number information;
setting corresponding register bits in the global control model specific register to the preset register bits in response to the first input;
the writing the error register packet number information into the preset register bit of the global control model special register includes:
judging a register value corresponding to a preset register bit to be written with the error register grouping number information, if a preset threshold condition is met, determining the preset register to be written as a target register bit, and writing the error register grouping number information into the target register bit.
5. The hardware-correctable error processing method of claim 3, wherein after the writing of the error register packet number information into the preset register bits of the global control model-specific register, the method further comprises:
generating a correctable machine inspection interrupt signal through the substrate input output system;
based on the correctable machine check interrupt signal, reading the preset register bit in the global control model special register, and acquiring the error register grouping number information corresponding to the hardware equipment when the correctable error occurs at the current moment.
6. The hardware-correctable error processing method of claim 2, further comprising:
judging whether the number of the error register groups is larger than a preset number, and if so, expanding the number of the preset register bits in the global control model special register according to the number of the error register groups.
7. The hardware-correctable error processing method of claim 1, wherein after the acquiring the hardware error data in the target error register packet and processing the hardware error data, the method further comprises:
Generating corresponding hardware error log information according to the hardware correctable error processing result;
and storing the hardware error log information into a system log, and displaying the hardware error log information in the system log through a display device.
8. A hardware correctable error processing system, comprising:
the first processing module is used for acquiring the corresponding error register grouping number information when the hardware equipment generates correctable errors at the current moment through the preset register bit in the global control model special register;
an error register packet determining module, configured to determine a target error register packet from a plurality of error register packets according to the error register packet number information, where the target error register packet stores hardware error data generated when the hardware device generates a correctable error at a current time;
and the second processing module is used for acquiring the hardware error data in the target error register group and processing the hardware error data to obtain a hardware correctable error processing result.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the hardware-correctable error handling method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the hardware-correctable error processing method according to any of claims 1 to 7.
CN202311253309.7A 2023-09-26 2023-09-26 Hardware correctable error processing method and system Pending CN117271190A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311253309.7A CN117271190A (en) 2023-09-26 2023-09-26 Hardware correctable error processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311253309.7A CN117271190A (en) 2023-09-26 2023-09-26 Hardware correctable error processing method and system

Publications (1)

Publication Number Publication Date
CN117271190A true CN117271190A (en) 2023-12-22

Family

ID=89217358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311253309.7A Pending CN117271190A (en) 2023-09-26 2023-09-26 Hardware correctable error processing method and system

Country Status (1)

Country Link
CN (1) CN117271190A (en)

Similar Documents

Publication Publication Date Title
US7971112B2 (en) Memory diagnosis method
KR100337218B1 (en) Computer ram memory system with enhanced scrubbing and sparing
US9720758B2 (en) Diagnostic analysis tool for disk storage engineering and technical support
US8032816B2 (en) Apparatus and method for distinguishing temporary and permanent errors in memory modules
US7971124B2 (en) Apparatus and method for distinguishing single bit errors in memory modules
CN102356384B (en) Method and device for data reliability detection
WO2019196199A1 (en) Method and device for processing bad tracks of disk and computer storage medium
US9529674B2 (en) Storage device management of unrecoverable logical block addresses for RAID data regeneration
CN108959526B (en) Log management method and log management device
CN104685474A (en) Notification of address range including non-correctable error
CN111414268A (en) Fault processing method and device and server
US20060218438A1 (en) Method and apparatus for identifying failure module
CN113918375B (en) Fault processing method and device, electronic equipment and storage medium
US20160110246A1 (en) Disk data management
US6842867B2 (en) System and method for identifying memory modules having a failing or defective address
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
CN112466382A (en) RAID array inspection method and device
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN111221775B (en) Processor, cache processing method and electronic equipment
WO2021027271A1 (en) Bad block information protection method and apparatus, computer device and storage medium
CN117271190A (en) Hardware correctable error processing method and system
CN113625957B (en) Method, device and equipment for detecting hard disk faults
CN110688242A (en) Method for recovering block information after abnormal power failure of SSD and related components
CN115509786A (en) Method, device, equipment and medium for reporting fault
CN112181712B (en) Method and device for improving reliability of processor core

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination