CN111143125B - MCE error processing method and device, electronic equipment and storage medium - Google Patents

MCE error processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111143125B
CN111143125B CN201911328687.0A CN201911328687A CN111143125B CN 111143125 B CN111143125 B CN 111143125B CN 201911328687 A CN201911328687 A CN 201911328687A CN 111143125 B CN111143125 B CN 111143125B
Authority
CN
China
Prior art keywords
error
mce
dcpmm
user interaction
mce error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911328687.0A
Other languages
Chinese (zh)
Other versions
CN111143125A (en
Inventor
来炜国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201911328687.0A priority Critical patent/CN111143125B/en
Publication of CN111143125A publication Critical patent/CN111143125A/en
Application granted granted Critical
Publication of CN111143125B publication Critical patent/CN111143125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking

Abstract

The application discloses a method, a device, equipment and a medium for MCE error processing, which comprise the following steps: receiving MCE errors sent by a memory controller of the DCPMM device; if the operation corresponding to the MCE error does not have the corresponding application program, the MCE error is sent to a user interaction process; if the operation corresponding to the MCE error has the corresponding application program, the MCE error is sent to a DCPMM recovery process for repair processing; if the recovery process fails to repair the MCE error, the MCE error is sent to the user interaction process; and determining an error processing mode by utilizing a user interaction process, and processing the MCE error according to the error processing mode. According to the application, whether the corresponding application program exists in the operation corresponding to the MCE error or not is determined, the MCE error is sent to the DCPMM recovery process and sent to the user interaction process for error processing, and the MCE error processing aiming at the DCPMM device is achieved.

Description

MCE error processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an MCE error processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
Modern processors all have an MEA (Machine Error Architecture), and when a system hardware Error occurs, the processor records relevant information in an MSR (Model Specific Register) and generates an MCE (Machine Check Error) to the operating system. The MCE handler in the operating system is responsible for processing the MCE, and if the hardware error is possibly recovered by Software, that is, SRAR (Software Recoverable Action Required), the MCE handler executes corresponding code to try to repair the hardware error. If the repair fails, or there is no corresponding repair code, or the hardware error is software unrecoverable, the operating system will log the hardware error and enter an error state.
DCPMM (Data Center Persistent Memory Module) is a Persistent Memory device using DIMM (Dual Inline Memory Module) Memory bank physical specifications, and has the advantages of large capacity, long service life, byte access, and the like. However, compared with a DRAM (dynamic random access memory), the memory cell of the DCPMM device is more prone to error, and therefore, how to provide a method for processing the MCE signal generated by the DCPMM device is a problem to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide an MCE error processing method and device, an electronic device and a computer readable storage medium, and the MCE error processing method and device, and the electronic device and the computer readable storage medium are used for processing MCE errors generated by a DCPMMM device.
In order to achieve the above object, the present application provides an MCE error handling method, including:
when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing;
if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
Optionally, if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing, where the repair processing includes:
if the operation corresponding to the MCE error is a memory read-write operation, the MCE error is sent to an application program corresponding to the operation;
and determining corresponding detailed description information by using the application program according to the error address corresponding to the MCE error, and sending the detailed description information to the DCPMM recovery process, so that the DCPMM recovery process can repair the MCE error by using a data protection mechanism.
Optionally, sending the MCE error to a DCPMM recovery process for repair processing includes:
sending the MCE error to the DCPMM recovery process;
determining an error address corresponding to the MCE error by using the DCPMM recovery process;
if the target data corresponding to the error address is stored in the cache storage, performing a flushing operation on the cache storage so as to flush the target data to the error address;
if the storage area where the error address is located has mirror image storage, reading the target data from the mirror image storage and writing the target data into the error address;
and if the error address belongs to the address in the software disk array, performing data recovery by using a RAID data recovery mechanism.
Optionally, if there is no corresponding application program in the operation corresponding to the MCE error, directly sending the MCE error to the user interaction process, including:
and if the operation corresponding to the MCE error is a memory address polling operation, directly sending the MCE error to a user interaction process.
Optionally, the determining, by using the user interaction process, an error handling manner so as to handle the MCE error according to the error handling manner includes:
determining a use mode of an error address corresponding to the MCE error by using the user interaction process;
and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
Optionally, the providing, according to the usage mode, a corresponding error handling manner by using the user interaction process includes:
if the use mode is a memory mode, providing a first type of processing mode by using the user interaction process; the first type of processing mode comprises the following steps: ignoring the current MCE error and continuing the system operation; the system is restarted.
Optionally, the providing, according to the usage mode, a corresponding error handling manner by using the user interaction process includes:
if the use mode is a persistent storage mode, providing a second type of processing mode by utilizing the user interaction process; the second type of processing mode comprises the following steps: deleting the current data of the error address; and overwriting the current data of the error address.
To achieve the above object, the present application provides an MCE error processing apparatus, comprising:
the error receiving module is used for receiving MCE errors sent by a memory controller of the DCPMM device after the DCPMM device generates errors;
the first sending module is used for directly sending the MCE error to a user interaction process if the operation corresponding to the MCE error does not have a corresponding application program;
the second sending module is used for sending the MCE error to a DCPMM recovery process for repair if the operation corresponding to the MCE error has a corresponding application program;
a third sending module, configured to send the MCE error to the user interaction process if the DCPMM recovery process fails to repair the MCE error;
and the error processing module is used for determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
To achieve the above object, the present application provides an electronic device including:
a memory for storing a computer program;
a processor for implementing the steps of any one of the MCE error handling methods disclosed above when executing the computer program.
To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of any one of the MCE error handling methods disclosed in the foregoing.
According to the above scheme, the MCE error processing method provided by the present application includes: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device; if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process; if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing; if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process; and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode. Therefore, according to the application, after the DCPMM device generates an error, the MCE error is sent through the memory controller, the MCE error can be sent to the DCPMM recovery process to be repaired according to whether the operation corresponding to the MCE error exists in the corresponding application program or not, or the MCE error is sent to the user interaction process, the user interaction process determines the error processing mode and processes the MCE error, and the processing aiming at the MCE error generated by the DCPMM device is realized.
The application also discloses an MCE error processing device, an electronic device and a computer readable storage medium, and the technical effects can be realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an MCE error handling method disclosed in an embodiment of the present application;
fig. 2 is a flowchart of another MCE error handling method disclosed in an embodiment of the present application;
fig. 3 is a flowchart of another MCE error handling method disclosed in the embodiment of the present application;
fig. 4 is a structural diagram of an MCE error handling apparatus disclosed in an embodiment of the present application;
fig. 5 is a block diagram of an electronic device disclosed in an embodiment of the present application;
fig. 6 is a block diagram of another electronic device disclosed in the embodiments of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, compared with a DRAM memory, a memory cell of a DCPMM device is more prone to error, and the DCPMM device is a novel device, and how to process an MCE signal generated by the DCPMM device is a problem to be solved by those skilled in the art.
Therefore, the embodiment of the application discloses an MCE error processing method, which realizes the processing of MCE errors generated by a DCPMM device.
Referring to fig. 1, an MCE error handling method disclosed in an embodiment of the present application includes:
s101: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
in specific implementation, a memory cell of the DCPMM device is more prone to errors, so that ECC verification is required for data reading and writing of the DCPMM device, when uncorrectable errors occur on the data on the memory cell, the DCPMM device labels toxicity of the data of the memory cell and informs a memory controller through a DDRT bus signal, and the memory controller generates an MSMI signal and sends the MSMI signal to a BIOS after knowing that the uncorrectable errors occur on the DCPMM device, and generates an MCE signal and sends the MCE signal to an operating system.
S102: if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
s103: if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing;
after the operating system receives the MCE error, the MCE handler is used to determine the corresponding error type, and the MCE handler is responsible for processing the MCE signal.
For MCE errors generated by DCPMM devices, the memory controller error types may include, but are not limited to, memory read and write errors, memory patrol errors, generic undefined requests, etc. When the MCE handler is used for determining that the operation corresponding to the current MCE error does not have the corresponding application program, namely the operation corresponding to the current MCE error is a memory address inspection operation or a general undefined request, the representation cannot acquire further detailed information of the error address corresponding to the current MCE error, and therefore the MCE error is directly sent to the user interaction process. When the MCE handler is used for determining that the operation corresponding to the current MCE error exists in the corresponding application program, namely the operation corresponding to the current MCE error is the memory read-write operation, the MCE error can be sent to the DCPMM recovery process for repair processing.
Specifically, the specific process of sending the MCE error to the DCPMM recovery process for repair processing may be: sending the MCE error to an application program corresponding to the operation; and determining corresponding detailed description information according to the error address corresponding to the MCE error by using the application program, and sending the detailed description information to the DCPMM recovery process so that the DCPMM recovery process can repair the current MCE error by using a data protection mechanism.
It will be appreciated that the above-described application is specifically a DCPMM-friendly application. After the application program receives a signal corresponding to an MCE error sent by an MCE handler, first, detailed description information related to the MCE error can be obtained according to an error address corresponding to the current MCE error, and the detailed description information and the error address are sent to a DCPMM recovery process together, so that the DCPMM recovery process can process the MCE error. If the processing is successful, the MCE processing mechanism may be exited.
S104: if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
it should be noted that, if the MCE error is not successfully repaired by the DCPMM recovery process, the MCE error that has failed to be repaired may be sent to the user interaction process for processing.
S105: and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
In the embodiment of the application, detailed description information of the current MCE error can be provided to a user by using a user interaction process, and the user selects a processing mode for the current MCE error. Specifically, detailed description information of the current MCE error can be displayed through a preset interactive interface, and further, all selectable processing modes can be displayed, and a user can select the error processing mode through clicking or touching and the like so as to process the MCE error according to the processing mode.
According to the above scheme, the MCE error processing method provided by the present application includes: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device; if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process; if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing; if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process; and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode. Therefore, according to the application, after the DCPMM device generates an error, the MCE error is sent through the memory controller, the MCE error can be sent to the DCPMM recovery process to be repaired according to whether the operation corresponding to the MCE error exists in the corresponding application program or not, or the MCE error is sent to the user interaction process, the user interaction process determines the error processing mode and processes the MCE error, and the processing aiming at the MCE error generated by the DCPMM device is realized.
The embodiment of the application discloses another MCE error processing method, and compared with the previous embodiment, the embodiment further describes and optimizes the repair processing process of the DCPMM recovery process. Referring to fig. 2, specifically:
s201: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
s202: if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
s203: if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to the DCPMM recovery process;
s204: determining an error address corresponding to the MCE error by using the DCPMM recovery process;
s205: if the target data corresponding to the error address is stored in the cache storage, performing a flushing operation on the cache storage so as to flush the target data to the error address;
it should be noted that, in the embodiment of the present application, when an application corresponding to an MCE error exists in an operation corresponding to the MCE error, for example, when the operation is a memory read-write operation, after the MCE error is sent to the DCPMM recovery process, an error address corresponding to the current MCE error may be determined by using the DCPMM recovery process, and the error address may be further determined how to perform data rewriting on the error address. And searching whether target data corresponding to the error address exists in the cache storage, if so, performing a flushing operation on the cache storage, and rewriting the target data in the cache storage into the current error address.
S206: if the storage area where the error address is located has mirror image storage, reading the target data from the mirror image storage and writing the target data into the error address;
it can be understood that, if the storage area where the error address is located has mirror storage, the embodiment of the present application may read the target data corresponding to the current error address from the mirror storage, and rewrite the target data into the current error address, so that the writing is successful with a higher probability.
S207: if the error address belongs to the address in the software disk array, performing data recovery by using a RAID data recovery mechanism;
if the address in the error address software disk array, such as RAID5, RAID6, etc., the data recovery can be directly performed by using a general RAID data recovery mechanism, so as to complete the data recovery.
S208: if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
s209: and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
The embodiment of the application discloses another MCE error processing method, and compared with the previous embodiment, the embodiment further describes and optimizes the technical scheme. Referring to fig. 3, specifically:
s301: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
s302: if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
s303: if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing;
s304: if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
s305: determining a use mode of an error address corresponding to the MCE error by using the user interaction process;
in this embodiment of the application, if there is no corresponding application program for an operation corresponding to an MCE error, or the MCE error is unsuccessfully repaired by using the DCPMM recovery process, the operation may be sent to the user interaction process. Specifically, information such as an error address corresponding to the MCE error, a usage mode to which the error address belongs, an address mapping type, and a related file type may be sent.
It should be noted that the DCPMM generally has two usage modes, including a memory mode and a persistent storage mode. In the memory mode, the DCPMM device is volatile, that is, all contents are cleared after the system is restarted, that is, an uncorrectable error in the memory mode automatically disappears after the system is restarted. In the persistent storage mode, the DCPMM device is in persistent storage, and data in the DCPMM device does not change after the system is shut down or restarted.
S306: and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
As a feasible implementation manner, if the use mode to which the error address corresponding to the MCE error belongs is a memory mode, that is, the error will automatically disappear after the system is restarted or shut down, a first type of processing manner may be provided by using a user interaction process; the first processing mode may include, but is not limited to, ignoring the current MCE error to continue the system operation; the system is restarted. That is, if the current error has no influence on the operation or use of the subsequent system, the current MCE error can be ignored, the MCE processing mechanism is exited, and the operation of the system is continued; if the current error has influence on the operation or use of the subsequent system, the MCE processing mechanism can be quitted, and the system is restarted after important information of the system is stored.
If the use mode persistent storage mode to which the error address corresponding to the MCE error belongs is the use mode persistent storage mode, namely the error still exists after the system is restarted or shut down, a second type of processing mode can be provided for the user by utilizing the user interaction process; the second processing manner may include, but is not limited to: deleting the current data of the error address; or to overwrite the current data of the wrong address. When the current data of the error address is overwritten, a specific value may be used for overwriting, for example, the current data of the error address is overwritten by a value 0; other files may also be used for overwriting, such as using the last backup of the file corresponding to the error address to overwrite the current data at the error address. The data coverage manner may be default by the system, or may be specified by the user according to the requirement, and is not specifically limited herein.
Specifically, a user interaction process may be utilized to provide a corresponding error handling mode for a user, and a selection instruction for the error handling mode issued by the user is received through a preset interface, so that the MCE error may be handled according to the error handling mode selected by the selection instruction.
In the following, an MCE error processing apparatus provided in an embodiment of the present application is introduced, and an MCE error processing apparatus described below and an MCE error processing method described above may be referred to each other.
Referring to fig. 4, an MCE error processing apparatus provided in an embodiment of the present application includes:
an error receiving module 401, configured to receive, when an error occurs in the DCPMM device, an MCE error sent by a memory controller of the DCPMM device;
a first sending module 402, configured to directly send the MCE error to a user interaction process if the operation corresponding to the MCE error does not have a corresponding application program;
a second sending module 403, configured to send the MCE error to a DCPMM recovery process for repair if the operation corresponding to the MCE error has a corresponding application program;
a third sending module 404, configured to send the MCE error to the user interaction process if the DCPMM recovery process fails to repair the MCE error;
an error processing module 405, configured to determine an error processing manner by using the user interaction process, so as to process the MCE error according to the error processing manner.
For the specific implementation process of the modules 401 to 405, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.
The present application further provides an electronic device, and as shown in fig. 5, an electronic device provided in an embodiment of the present application includes:
a memory 100 for storing a computer program;
the processor 200, when executing the computer program, may implement the steps provided by the above embodiments.
Specifically, the memory 100 includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions, and the internal memory provides an environment for the operating system and the computer-readable instructions in the non-volatile storage medium to run. The processor 200 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and provides computing and controlling capabilities for an electronic device, and when executing the computer program stored in the memory 100, the steps of the MCE error Processing method provided in any of the foregoing embodiments may be implemented.
On the basis of the above embodiment, as a preferred implementation, referring to fig. 6, the electronic device further includes:
and an input interface 300 connected to the processor 200, for acquiring computer programs, parameters and instructions imported from the outside, and storing the computer programs, parameters and instructions into the memory 100 under the control of the processor 200. The input interface 300 may be connected to an input device for receiving parameters or instructions manually input by a user. The input device may be a touch layer covered on a display screen, or a button, a track ball or a touch pad arranged on a terminal shell, or a keyboard, a touch pad or a mouse, etc.
And a display unit 400 connected to the processor 200 for displaying data processed by the processor 200 and for displaying a visualized user interface. The display unit 400 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like.
And a network port 500 connected to the processor 200 for performing communication connection with each external terminal device. The communication technology adopted by the communication connection can be a wired communication technology or a wireless communication technology, such as a mobile high definition link (MHL) technology, a Universal Serial Bus (USB), a High Definition Multimedia Interface (HDMI), a wireless fidelity (WiFi), a bluetooth communication technology, a low power consumption bluetooth communication technology, an ieee802.11 s-based communication technology, and the like.
While FIG. 6 shows only an electronic device having the assembly 100 and 500, those skilled in the art will appreciate that the configuration shown in FIG. 6 is not intended to be limiting of electronic devices and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The present application also provides a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk. The storage medium stores thereon a computer program, which when executed by a processor implements the steps of the MCE error handling method provided in any of the foregoing embodiments.
According to the application, after the DCPMM device has errors, the MCE errors are sent through the memory controller, the MCE errors can be sent to the DCPMM recovery process to be repaired according to whether the corresponding application program exists in the operation corresponding to the MCE errors or not, or the MCE errors are sent to the user interaction process, the user interaction process determines the error processing mode and processes the MCE errors, and the MCE errors generated by the DCPMM device are processed.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (8)

1. An MCE error handling method, comprising:
when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing; if the operation corresponding to the MCE error is a memory read-write operation, sending the MCE error to an application program corresponding to the operation; determining corresponding detailed description information according to the error address corresponding to the MCE error by using the application program, and sending the detailed description information to the DCPMM recovery process so that the DCPMM recovery process can repair the MCE error by using a data protection mechanism;
if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
determining a use mode of an error address corresponding to the MCE error by using the user interaction process;
and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
2. The MCE error handling method according to claim 1, wherein sending the MCE error to a DCPMM recovery process for repair processing comprises:
sending the MCE error to the DCPMM recovery process;
determining an error address corresponding to the MCE error by using the DCPMM recovery process;
if the target data corresponding to the error address is stored in the cache storage, performing a flushing operation on the cache storage so as to flush the target data to the error address;
if the storage area where the error address is located has mirror image storage, reading the target data from the mirror image storage and writing the target data into the error address;
and if the error address belongs to the address in the software disk array, performing data recovery by using a RAID data recovery mechanism.
3. The MCE error processing method according to claim 1, wherein if there is no corresponding application program in the operation corresponding to the MCE error, directly sending the MCE error to a user interaction process, includes:
and if the operation corresponding to the MCE error is a memory address polling operation, directly sending the MCE error to a user interaction process.
4. The MCE error handling method according to any one of claims 1 to 3, wherein the providing a corresponding error handling manner by using the user interaction process according to the usage pattern includes:
if the use mode is a memory mode, providing a first type of processing mode by using the user interaction process; the first type of processing mode comprises the following steps: ignoring the current MCE error and continuing the system operation; the system is restarted.
5. The MCE error handling method according to any one of claims 1 to 3, wherein the providing a corresponding error handling manner by using the user interaction process according to the usage pattern includes:
if the use mode is a persistent storage mode, providing a second type of processing mode by utilizing the user interaction process; the second type of processing mode comprises the following steps: deleting the current data of the error address; and overwriting the current data of the error address.
6. An MCE error handling apparatus, comprising:
the error receiving module is used for receiving MCE errors sent by a memory controller of the DCPMM device after the DCPMM device generates errors;
the first sending module is used for directly sending the MCE error to a user interaction process if the operation corresponding to the MCE error does not have a corresponding application program;
the second sending module is used for sending the MCE error to a DCPMM recovery process for repair if the operation corresponding to the MCE error has a corresponding application program; if the operation corresponding to the MCE error is a memory read-write operation, sending the MCE error to an application program corresponding to the operation; determining corresponding detailed description information according to the error address corresponding to the MCE error by using the application program, and sending the detailed description information to the DCPMM recovery process so that the DCPMM recovery process can repair the MCE error by using a data protection mechanism;
a third sending module, configured to send the MCE error to the user interaction process if the DCPMM recovery process fails to repair the MCE error;
the error processing module is used for determining the use mode of the error address corresponding to the MCE error by utilizing the user interaction process; and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
7. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the MCE error handling method as claimed in any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the MCE error handling method according to any one of claims 1 to 5.
CN201911328687.0A 2019-12-20 2019-12-20 MCE error processing method and device, electronic equipment and storage medium Active CN111143125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911328687.0A CN111143125B (en) 2019-12-20 2019-12-20 MCE error processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911328687.0A CN111143125B (en) 2019-12-20 2019-12-20 MCE error processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111143125A CN111143125A (en) 2020-05-12
CN111143125B true CN111143125B (en) 2022-04-22

Family

ID=70519174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911328687.0A Active CN111143125B (en) 2019-12-20 2019-12-20 MCE error processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111143125B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127388A (en) * 2021-04-13 2021-07-16 郑州云海信息技术有限公司 Metadata writing method and related device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308285A (en) * 2011-07-26 2012-01-04 华为技术有限公司 Memory bug application of application program
CN103593251A (en) * 2013-11-07 2014-02-19 浪潮电子信息产业股份有限公司 Fault-tolerant system based on process redundancy and design method thereof
CN105354101A (en) * 2015-10-09 2016-02-24 上海瀚银信息技术有限公司 Operation error processing method and system and intelligent terminal
CN105589762A (en) * 2014-08-19 2016-05-18 三星电子株式会社 Memory Devices, Memory Modules And Method For Correction
CN105893166A (en) * 2016-04-29 2016-08-24 浪潮电子信息产业股份有限公司 Method and device for processing memory errors

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949904B2 (en) * 2005-05-04 2011-05-24 Microsoft Corporation System and method for hardware error reporting and recovery
US9934086B2 (en) * 2016-06-06 2018-04-03 Micron Technology, Inc. Apparatuses and methods for selective determination of data error repair
US20190034252A1 (en) * 2017-07-28 2019-01-31 Hewlett Packard Enterprise Development Lp Processor error event handler

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308285A (en) * 2011-07-26 2012-01-04 华为技术有限公司 Memory bug application of application program
CN103593251A (en) * 2013-11-07 2014-02-19 浪潮电子信息产业股份有限公司 Fault-tolerant system based on process redundancy and design method thereof
CN105589762A (en) * 2014-08-19 2016-05-18 三星电子株式会社 Memory Devices, Memory Modules And Method For Correction
CN105354101A (en) * 2015-10-09 2016-02-24 上海瀚银信息技术有限公司 Operation error processing method and system and intelligent terminal
CN105893166A (en) * 2016-04-29 2016-08-24 浪潮电子信息产业股份有限公司 Method and device for processing memory errors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向集群结构的计算机故障管理系统的研究与实现;程龙;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20160315;第2016卷(第03期);全文 *

Also Published As

Publication number Publication date
CN111143125A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
US10146627B2 (en) Mobile flash storage boot partition and/or logical unit shadowing
US8645749B2 (en) Systems and methods for storing and recovering controller data in non-volatile memory devices
KR102147970B1 (en) Method of reparing non-volatile memory based storage device and operation method of electronic system including the storage device
EP2567319B1 (en) Methods and system for verifying memory device integrity
CN108399134A (en) The operating method of storage device and storage device
US10303560B2 (en) Systems and methods for eliminating write-hole problems on parity-based storage resources during an unexpected power loss
US10120769B2 (en) Raid rebuild algorithm with low I/O impact
US20120066568A1 (en) Storage device, electronic device, and data error correction method
US20130191705A1 (en) Semiconductor storage device
EP2567320B1 (en) Methods and system for verifying memory device integrity
US8086841B2 (en) BIOS switching system and a method thereof
US11138080B2 (en) Apparatus and method for reducing cell disturb in an open block of a memory system during a recovery procedure
US8301942B2 (en) Managing possibly logically bad blocks in storage devices
US7783918B2 (en) Data protection method of storage device
US20200356476A1 (en) Incomplete Write Group Journal
KR20160074025A (en) Operating method for data storage device
CN111143125B (en) MCE error processing method and device, electronic equipment and storage medium
US8667325B2 (en) Method, apparatus and system for providing memory sparing information
CN111752475B (en) Method and device for data access management in storage server
CN114765051A (en) Memory test method and device, readable storage medium and electronic equipment
JP5842655B2 (en) Information processing apparatus, program, and error processing method
US11809742B2 (en) Recovery from HMB loss
TWI712052B (en) Memory management method, storage controller and storage device
CN115756946A (en) File inspection method and device
EP2469412B1 (en) Methods and system for verifying memory device integrity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant