CN111143125B - MCE error processing method and device, electronic equipment and storage medium - Google Patents
MCE error processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111143125B CN111143125B CN201911328687.0A CN201911328687A CN111143125B CN 111143125 B CN111143125 B CN 111143125B CN 201911328687 A CN201911328687 A CN 201911328687A CN 111143125 B CN111143125 B CN 111143125B
- Authority
- CN
- China
- Prior art keywords
- error
- mce
- dcpmm
- user interaction
- mce error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims description 13
- 238000000034 method Methods 0.000 claims abstract description 109
- 230000008569 process Effects 0.000 claims abstract description 88
- 238000012545 processing Methods 0.000 claims abstract description 87
- 230000003993 interaction Effects 0.000 claims abstract description 65
- 238000011084 recovery Methods 0.000 claims abstract description 60
- 230000008439 repair process Effects 0.000 claims abstract description 35
- 238000004590 computer program Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 11
- 230000002085 persistent effect Effects 0.000 claims description 9
- 238000011010 flushing procedure Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1479—Generic software techniques for error detection or fault masking
Abstract
The application discloses a method, a device, equipment and a medium for MCE error processing, which comprise the following steps: receiving MCE errors sent by a memory controller of the DCPMM device; if the operation corresponding to the MCE error does not have the corresponding application program, the MCE error is sent to a user interaction process; if the operation corresponding to the MCE error has the corresponding application program, the MCE error is sent to a DCPMM recovery process for repair processing; if the recovery process fails to repair the MCE error, the MCE error is sent to the user interaction process; and determining an error processing mode by utilizing a user interaction process, and processing the MCE error according to the error processing mode. According to the application, whether the corresponding application program exists in the operation corresponding to the MCE error or not is determined, the MCE error is sent to the DCPMM recovery process and sent to the user interaction process for error processing, and the MCE error processing aiming at the DCPMM device is achieved.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to an MCE error processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
Modern processors all have an MEA (Machine Error Architecture), and when a system hardware Error occurs, the processor records relevant information in an MSR (Model Specific Register) and generates an MCE (Machine Check Error) to the operating system. The MCE handler in the operating system is responsible for processing the MCE, and if the hardware error is possibly recovered by Software, that is, SRAR (Software Recoverable Action Required), the MCE handler executes corresponding code to try to repair the hardware error. If the repair fails, or there is no corresponding repair code, or the hardware error is software unrecoverable, the operating system will log the hardware error and enter an error state.
DCPMM (Data Center Persistent Memory Module) is a Persistent Memory device using DIMM (Dual Inline Memory Module) Memory bank physical specifications, and has the advantages of large capacity, long service life, byte access, and the like. However, compared with a DRAM (dynamic random access memory), the memory cell of the DCPMM device is more prone to error, and therefore, how to provide a method for processing the MCE signal generated by the DCPMM device is a problem to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide an MCE error processing method and device, an electronic device and a computer readable storage medium, and the MCE error processing method and device, and the electronic device and the computer readable storage medium are used for processing MCE errors generated by a DCPMMM device.
In order to achieve the above object, the present application provides an MCE error handling method, including:
when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing;
if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
Optionally, if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing, where the repair processing includes:
if the operation corresponding to the MCE error is a memory read-write operation, the MCE error is sent to an application program corresponding to the operation;
and determining corresponding detailed description information by using the application program according to the error address corresponding to the MCE error, and sending the detailed description information to the DCPMM recovery process, so that the DCPMM recovery process can repair the MCE error by using a data protection mechanism.
Optionally, sending the MCE error to a DCPMM recovery process for repair processing includes:
sending the MCE error to the DCPMM recovery process;
determining an error address corresponding to the MCE error by using the DCPMM recovery process;
if the target data corresponding to the error address is stored in the cache storage, performing a flushing operation on the cache storage so as to flush the target data to the error address;
if the storage area where the error address is located has mirror image storage, reading the target data from the mirror image storage and writing the target data into the error address;
and if the error address belongs to the address in the software disk array, performing data recovery by using a RAID data recovery mechanism.
Optionally, if there is no corresponding application program in the operation corresponding to the MCE error, directly sending the MCE error to the user interaction process, including:
and if the operation corresponding to the MCE error is a memory address polling operation, directly sending the MCE error to a user interaction process.
Optionally, the determining, by using the user interaction process, an error handling manner so as to handle the MCE error according to the error handling manner includes:
determining a use mode of an error address corresponding to the MCE error by using the user interaction process;
and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
Optionally, the providing, according to the usage mode, a corresponding error handling manner by using the user interaction process includes:
if the use mode is a memory mode, providing a first type of processing mode by using the user interaction process; the first type of processing mode comprises the following steps: ignoring the current MCE error and continuing the system operation; the system is restarted.
Optionally, the providing, according to the usage mode, a corresponding error handling manner by using the user interaction process includes:
if the use mode is a persistent storage mode, providing a second type of processing mode by utilizing the user interaction process; the second type of processing mode comprises the following steps: deleting the current data of the error address; and overwriting the current data of the error address.
To achieve the above object, the present application provides an MCE error processing apparatus, comprising:
the error receiving module is used for receiving MCE errors sent by a memory controller of the DCPMM device after the DCPMM device generates errors;
the first sending module is used for directly sending the MCE error to a user interaction process if the operation corresponding to the MCE error does not have a corresponding application program;
the second sending module is used for sending the MCE error to a DCPMM recovery process for repair if the operation corresponding to the MCE error has a corresponding application program;
a third sending module, configured to send the MCE error to the user interaction process if the DCPMM recovery process fails to repair the MCE error;
and the error processing module is used for determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
To achieve the above object, the present application provides an electronic device including:
a memory for storing a computer program;
a processor for implementing the steps of any one of the MCE error handling methods disclosed above when executing the computer program.
To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of any one of the MCE error handling methods disclosed in the foregoing.
According to the above scheme, the MCE error processing method provided by the present application includes: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device; if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process; if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing; if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process; and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode. Therefore, according to the application, after the DCPMM device generates an error, the MCE error is sent through the memory controller, the MCE error can be sent to the DCPMM recovery process to be repaired according to whether the operation corresponding to the MCE error exists in the corresponding application program or not, or the MCE error is sent to the user interaction process, the user interaction process determines the error processing mode and processes the MCE error, and the processing aiming at the MCE error generated by the DCPMM device is realized.
The application also discloses an MCE error processing device, an electronic device and a computer readable storage medium, and the technical effects can be realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an MCE error handling method disclosed in an embodiment of the present application;
fig. 2 is a flowchart of another MCE error handling method disclosed in an embodiment of the present application;
fig. 3 is a flowchart of another MCE error handling method disclosed in the embodiment of the present application;
fig. 4 is a structural diagram of an MCE error handling apparatus disclosed in an embodiment of the present application;
fig. 5 is a block diagram of an electronic device disclosed in an embodiment of the present application;
fig. 6 is a block diagram of another electronic device disclosed in the embodiments of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, compared with a DRAM memory, a memory cell of a DCPMM device is more prone to error, and the DCPMM device is a novel device, and how to process an MCE signal generated by the DCPMM device is a problem to be solved by those skilled in the art.
Therefore, the embodiment of the application discloses an MCE error processing method, which realizes the processing of MCE errors generated by a DCPMM device.
Referring to fig. 1, an MCE error handling method disclosed in an embodiment of the present application includes:
s101: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
in specific implementation, a memory cell of the DCPMM device is more prone to errors, so that ECC verification is required for data reading and writing of the DCPMM device, when uncorrectable errors occur on the data on the memory cell, the DCPMM device labels toxicity of the data of the memory cell and informs a memory controller through a DDRT bus signal, and the memory controller generates an MSMI signal and sends the MSMI signal to a BIOS after knowing that the uncorrectable errors occur on the DCPMM device, and generates an MCE signal and sends the MCE signal to an operating system.
S102: if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
s103: if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing;
after the operating system receives the MCE error, the MCE handler is used to determine the corresponding error type, and the MCE handler is responsible for processing the MCE signal.
For MCE errors generated by DCPMM devices, the memory controller error types may include, but are not limited to, memory read and write errors, memory patrol errors, generic undefined requests, etc. When the MCE handler is used for determining that the operation corresponding to the current MCE error does not have the corresponding application program, namely the operation corresponding to the current MCE error is a memory address inspection operation or a general undefined request, the representation cannot acquire further detailed information of the error address corresponding to the current MCE error, and therefore the MCE error is directly sent to the user interaction process. When the MCE handler is used for determining that the operation corresponding to the current MCE error exists in the corresponding application program, namely the operation corresponding to the current MCE error is the memory read-write operation, the MCE error can be sent to the DCPMM recovery process for repair processing.
Specifically, the specific process of sending the MCE error to the DCPMM recovery process for repair processing may be: sending the MCE error to an application program corresponding to the operation; and determining corresponding detailed description information according to the error address corresponding to the MCE error by using the application program, and sending the detailed description information to the DCPMM recovery process so that the DCPMM recovery process can repair the current MCE error by using a data protection mechanism.
It will be appreciated that the above-described application is specifically a DCPMM-friendly application. After the application program receives a signal corresponding to an MCE error sent by an MCE handler, first, detailed description information related to the MCE error can be obtained according to an error address corresponding to the current MCE error, and the detailed description information and the error address are sent to a DCPMM recovery process together, so that the DCPMM recovery process can process the MCE error. If the processing is successful, the MCE processing mechanism may be exited.
S104: if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
it should be noted that, if the MCE error is not successfully repaired by the DCPMM recovery process, the MCE error that has failed to be repaired may be sent to the user interaction process for processing.
S105: and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
In the embodiment of the application, detailed description information of the current MCE error can be provided to a user by using a user interaction process, and the user selects a processing mode for the current MCE error. Specifically, detailed description information of the current MCE error can be displayed through a preset interactive interface, and further, all selectable processing modes can be displayed, and a user can select the error processing mode through clicking or touching and the like so as to process the MCE error according to the processing mode.
According to the above scheme, the MCE error processing method provided by the present application includes: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device; if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process; if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing; if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process; and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode. Therefore, according to the application, after the DCPMM device generates an error, the MCE error is sent through the memory controller, the MCE error can be sent to the DCPMM recovery process to be repaired according to whether the operation corresponding to the MCE error exists in the corresponding application program or not, or the MCE error is sent to the user interaction process, the user interaction process determines the error processing mode and processes the MCE error, and the processing aiming at the MCE error generated by the DCPMM device is realized.
The embodiment of the application discloses another MCE error processing method, and compared with the previous embodiment, the embodiment further describes and optimizes the repair processing process of the DCPMM recovery process. Referring to fig. 2, specifically:
s201: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
s202: if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
s203: if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to the DCPMM recovery process;
s204: determining an error address corresponding to the MCE error by using the DCPMM recovery process;
s205: if the target data corresponding to the error address is stored in the cache storage, performing a flushing operation on the cache storage so as to flush the target data to the error address;
it should be noted that, in the embodiment of the present application, when an application corresponding to an MCE error exists in an operation corresponding to the MCE error, for example, when the operation is a memory read-write operation, after the MCE error is sent to the DCPMM recovery process, an error address corresponding to the current MCE error may be determined by using the DCPMM recovery process, and the error address may be further determined how to perform data rewriting on the error address. And searching whether target data corresponding to the error address exists in the cache storage, if so, performing a flushing operation on the cache storage, and rewriting the target data in the cache storage into the current error address.
S206: if the storage area where the error address is located has mirror image storage, reading the target data from the mirror image storage and writing the target data into the error address;
it can be understood that, if the storage area where the error address is located has mirror storage, the embodiment of the present application may read the target data corresponding to the current error address from the mirror storage, and rewrite the target data into the current error address, so that the writing is successful with a higher probability.
S207: if the error address belongs to the address in the software disk array, performing data recovery by using a RAID data recovery mechanism;
if the address in the error address software disk array, such as RAID5, RAID6, etc., the data recovery can be directly performed by using a general RAID data recovery mechanism, so as to complete the data recovery.
S208: if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
s209: and determining an error processing mode by utilizing the user interaction process so as to process the MCE error according to the error processing mode.
The embodiment of the application discloses another MCE error processing method, and compared with the previous embodiment, the embodiment further describes and optimizes the technical scheme. Referring to fig. 3, specifically:
s301: when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
s302: if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
s303: if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing;
s304: if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
s305: determining a use mode of an error address corresponding to the MCE error by using the user interaction process;
in this embodiment of the application, if there is no corresponding application program for an operation corresponding to an MCE error, or the MCE error is unsuccessfully repaired by using the DCPMM recovery process, the operation may be sent to the user interaction process. Specifically, information such as an error address corresponding to the MCE error, a usage mode to which the error address belongs, an address mapping type, and a related file type may be sent.
It should be noted that the DCPMM generally has two usage modes, including a memory mode and a persistent storage mode. In the memory mode, the DCPMM device is volatile, that is, all contents are cleared after the system is restarted, that is, an uncorrectable error in the memory mode automatically disappears after the system is restarted. In the persistent storage mode, the DCPMM device is in persistent storage, and data in the DCPMM device does not change after the system is shut down or restarted.
S306: and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
As a feasible implementation manner, if the use mode to which the error address corresponding to the MCE error belongs is a memory mode, that is, the error will automatically disappear after the system is restarted or shut down, a first type of processing manner may be provided by using a user interaction process; the first processing mode may include, but is not limited to, ignoring the current MCE error to continue the system operation; the system is restarted. That is, if the current error has no influence on the operation or use of the subsequent system, the current MCE error can be ignored, the MCE processing mechanism is exited, and the operation of the system is continued; if the current error has influence on the operation or use of the subsequent system, the MCE processing mechanism can be quitted, and the system is restarted after important information of the system is stored.
If the use mode persistent storage mode to which the error address corresponding to the MCE error belongs is the use mode persistent storage mode, namely the error still exists after the system is restarted or shut down, a second type of processing mode can be provided for the user by utilizing the user interaction process; the second processing manner may include, but is not limited to: deleting the current data of the error address; or to overwrite the current data of the wrong address. When the current data of the error address is overwritten, a specific value may be used for overwriting, for example, the current data of the error address is overwritten by a value 0; other files may also be used for overwriting, such as using the last backup of the file corresponding to the error address to overwrite the current data at the error address. The data coverage manner may be default by the system, or may be specified by the user according to the requirement, and is not specifically limited herein.
Specifically, a user interaction process may be utilized to provide a corresponding error handling mode for a user, and a selection instruction for the error handling mode issued by the user is received through a preset interface, so that the MCE error may be handled according to the error handling mode selected by the selection instruction.
In the following, an MCE error processing apparatus provided in an embodiment of the present application is introduced, and an MCE error processing apparatus described below and an MCE error processing method described above may be referred to each other.
Referring to fig. 4, an MCE error processing apparatus provided in an embodiment of the present application includes:
an error receiving module 401, configured to receive, when an error occurs in the DCPMM device, an MCE error sent by a memory controller of the DCPMM device;
a first sending module 402, configured to directly send the MCE error to a user interaction process if the operation corresponding to the MCE error does not have a corresponding application program;
a second sending module 403, configured to send the MCE error to a DCPMM recovery process for repair if the operation corresponding to the MCE error has a corresponding application program;
a third sending module 404, configured to send the MCE error to the user interaction process if the DCPMM recovery process fails to repair the MCE error;
an error processing module 405, configured to determine an error processing manner by using the user interaction process, so as to process the MCE error according to the error processing manner.
For the specific implementation process of the modules 401 to 405, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.
The present application further provides an electronic device, and as shown in fig. 5, an electronic device provided in an embodiment of the present application includes:
a memory 100 for storing a computer program;
the processor 200, when executing the computer program, may implement the steps provided by the above embodiments.
Specifically, the memory 100 includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions, and the internal memory provides an environment for the operating system and the computer-readable instructions in the non-volatile storage medium to run. The processor 200 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and provides computing and controlling capabilities for an electronic device, and when executing the computer program stored in the memory 100, the steps of the MCE error Processing method provided in any of the foregoing embodiments may be implemented.
On the basis of the above embodiment, as a preferred implementation, referring to fig. 6, the electronic device further includes:
and an input interface 300 connected to the processor 200, for acquiring computer programs, parameters and instructions imported from the outside, and storing the computer programs, parameters and instructions into the memory 100 under the control of the processor 200. The input interface 300 may be connected to an input device for receiving parameters or instructions manually input by a user. The input device may be a touch layer covered on a display screen, or a button, a track ball or a touch pad arranged on a terminal shell, or a keyboard, a touch pad or a mouse, etc.
And a display unit 400 connected to the processor 200 for displaying data processed by the processor 200 and for displaying a visualized user interface. The display unit 400 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like.
And a network port 500 connected to the processor 200 for performing communication connection with each external terminal device. The communication technology adopted by the communication connection can be a wired communication technology or a wireless communication technology, such as a mobile high definition link (MHL) technology, a Universal Serial Bus (USB), a High Definition Multimedia Interface (HDMI), a wireless fidelity (WiFi), a bluetooth communication technology, a low power consumption bluetooth communication technology, an ieee802.11 s-based communication technology, and the like.
While FIG. 6 shows only an electronic device having the assembly 100 and 500, those skilled in the art will appreciate that the configuration shown in FIG. 6 is not intended to be limiting of electronic devices and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The present application also provides a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk. The storage medium stores thereon a computer program, which when executed by a processor implements the steps of the MCE error handling method provided in any of the foregoing embodiments.
According to the application, after the DCPMM device has errors, the MCE errors are sent through the memory controller, the MCE errors can be sent to the DCPMM recovery process to be repaired according to whether the corresponding application program exists in the operation corresponding to the MCE errors or not, or the MCE errors are sent to the user interaction process, the user interaction process determines the error processing mode and processes the MCE errors, and the MCE errors generated by the DCPMM device are processed.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Claims (8)
1. An MCE error handling method, comprising:
when the DCPMM device has an error, receiving an MCE error sent by a memory controller of the DCPMM device;
if the operation corresponding to the MCE error does not have the corresponding application program, directly sending the MCE error to a user interaction process;
if the operation corresponding to the MCE error has a corresponding application program, sending the MCE error to a DCPMM recovery process for repair processing; if the operation corresponding to the MCE error is a memory read-write operation, sending the MCE error to an application program corresponding to the operation; determining corresponding detailed description information according to the error address corresponding to the MCE error by using the application program, and sending the detailed description information to the DCPMM recovery process so that the DCPMM recovery process can repair the MCE error by using a data protection mechanism;
if the DCPMM recovery process fails to repair the MCE error, sending the MCE error to the user interaction process;
determining a use mode of an error address corresponding to the MCE error by using the user interaction process;
and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
2. The MCE error handling method according to claim 1, wherein sending the MCE error to a DCPMM recovery process for repair processing comprises:
sending the MCE error to the DCPMM recovery process;
determining an error address corresponding to the MCE error by using the DCPMM recovery process;
if the target data corresponding to the error address is stored in the cache storage, performing a flushing operation on the cache storage so as to flush the target data to the error address;
if the storage area where the error address is located has mirror image storage, reading the target data from the mirror image storage and writing the target data into the error address;
and if the error address belongs to the address in the software disk array, performing data recovery by using a RAID data recovery mechanism.
3. The MCE error processing method according to claim 1, wherein if there is no corresponding application program in the operation corresponding to the MCE error, directly sending the MCE error to a user interaction process, includes:
and if the operation corresponding to the MCE error is a memory address polling operation, directly sending the MCE error to a user interaction process.
4. The MCE error handling method according to any one of claims 1 to 3, wherein the providing a corresponding error handling manner by using the user interaction process according to the usage pattern includes:
if the use mode is a memory mode, providing a first type of processing mode by using the user interaction process; the first type of processing mode comprises the following steps: ignoring the current MCE error and continuing the system operation; the system is restarted.
5. The MCE error handling method according to any one of claims 1 to 3, wherein the providing a corresponding error handling manner by using the user interaction process according to the usage pattern includes:
if the use mode is a persistent storage mode, providing a second type of processing mode by utilizing the user interaction process; the second type of processing mode comprises the following steps: deleting the current data of the error address; and overwriting the current data of the error address.
6. An MCE error handling apparatus, comprising:
the error receiving module is used for receiving MCE errors sent by a memory controller of the DCPMM device after the DCPMM device generates errors;
the first sending module is used for directly sending the MCE error to a user interaction process if the operation corresponding to the MCE error does not have a corresponding application program;
the second sending module is used for sending the MCE error to a DCPMM recovery process for repair if the operation corresponding to the MCE error has a corresponding application program; if the operation corresponding to the MCE error is a memory read-write operation, sending the MCE error to an application program corresponding to the operation; determining corresponding detailed description information according to the error address corresponding to the MCE error by using the application program, and sending the detailed description information to the DCPMM recovery process so that the DCPMM recovery process can repair the MCE error by using a data protection mechanism;
a third sending module, configured to send the MCE error to the user interaction process if the DCPMM recovery process fails to repair the MCE error;
the error processing module is used for determining the use mode of the error address corresponding to the MCE error by utilizing the user interaction process; and providing a corresponding error processing mode by using the user interaction process according to the use mode, and receiving a selection instruction aiming at the error processing mode through a preset interface so as to process the MCE error according to the error processing mode selected by the selection instruction.
7. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the MCE error handling method as claimed in any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the MCE error handling method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911328687.0A CN111143125B (en) | 2019-12-20 | 2019-12-20 | MCE error processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911328687.0A CN111143125B (en) | 2019-12-20 | 2019-12-20 | MCE error processing method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111143125A CN111143125A (en) | 2020-05-12 |
CN111143125B true CN111143125B (en) | 2022-04-22 |
Family
ID=70519174
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911328687.0A Active CN111143125B (en) | 2019-12-20 | 2019-12-20 | MCE error processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111143125B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113127388A (en) * | 2021-04-13 | 2021-07-16 | 郑州云海信息技术有限公司 | Metadata writing method and related device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102308285A (en) * | 2011-07-26 | 2012-01-04 | 华为技术有限公司 | Memory bug application of application program |
CN103593251A (en) * | 2013-11-07 | 2014-02-19 | 浪潮电子信息产业股份有限公司 | Fault-tolerant system based on process redundancy and design method thereof |
CN105354101A (en) * | 2015-10-09 | 2016-02-24 | 上海瀚银信息技术有限公司 | Operation error processing method and system and intelligent terminal |
CN105589762A (en) * | 2014-08-19 | 2016-05-18 | 三星电子株式会社 | Memory Devices, Memory Modules And Method For Correction |
CN105893166A (en) * | 2016-04-29 | 2016-08-24 | 浪潮电子信息产业股份有限公司 | Method and device for processing memory errors |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7949904B2 (en) * | 2005-05-04 | 2011-05-24 | Microsoft Corporation | System and method for hardware error reporting and recovery |
US9934086B2 (en) * | 2016-06-06 | 2018-04-03 | Micron Technology, Inc. | Apparatuses and methods for selective determination of data error repair |
US20190034252A1 (en) * | 2017-07-28 | 2019-01-31 | Hewlett Packard Enterprise Development Lp | Processor error event handler |
-
2019
- 2019-12-20 CN CN201911328687.0A patent/CN111143125B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102308285A (en) * | 2011-07-26 | 2012-01-04 | 华为技术有限公司 | Memory bug application of application program |
CN103593251A (en) * | 2013-11-07 | 2014-02-19 | 浪潮电子信息产业股份有限公司 | Fault-tolerant system based on process redundancy and design method thereof |
CN105589762A (en) * | 2014-08-19 | 2016-05-18 | 三星电子株式会社 | Memory Devices, Memory Modules And Method For Correction |
CN105354101A (en) * | 2015-10-09 | 2016-02-24 | 上海瀚银信息技术有限公司 | Operation error processing method and system and intelligent terminal |
CN105893166A (en) * | 2016-04-29 | 2016-08-24 | 浪潮电子信息产业股份有限公司 | Method and device for processing memory errors |
Non-Patent Citations (1)
Title |
---|
面向集群结构的计算机故障管理系统的研究与实现;程龙;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20160315;第2016卷(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111143125A (en) | 2020-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10146627B2 (en) | Mobile flash storage boot partition and/or logical unit shadowing | |
US8645749B2 (en) | Systems and methods for storing and recovering controller data in non-volatile memory devices | |
KR102147970B1 (en) | Method of reparing non-volatile memory based storage device and operation method of electronic system including the storage device | |
EP2567319B1 (en) | Methods and system for verifying memory device integrity | |
CN108399134A (en) | The operating method of storage device and storage device | |
US10303560B2 (en) | Systems and methods for eliminating write-hole problems on parity-based storage resources during an unexpected power loss | |
US10120769B2 (en) | Raid rebuild algorithm with low I/O impact | |
US20120066568A1 (en) | Storage device, electronic device, and data error correction method | |
US20130191705A1 (en) | Semiconductor storage device | |
EP2567320B1 (en) | Methods and system for verifying memory device integrity | |
US8086841B2 (en) | BIOS switching system and a method thereof | |
US11138080B2 (en) | Apparatus and method for reducing cell disturb in an open block of a memory system during a recovery procedure | |
US8301942B2 (en) | Managing possibly logically bad blocks in storage devices | |
US7783918B2 (en) | Data protection method of storage device | |
US20200356476A1 (en) | Incomplete Write Group Journal | |
KR20160074025A (en) | Operating method for data storage device | |
CN111143125B (en) | MCE error processing method and device, electronic equipment and storage medium | |
US8667325B2 (en) | Method, apparatus and system for providing memory sparing information | |
CN111752475B (en) | Method and device for data access management in storage server | |
CN114765051A (en) | Memory test method and device, readable storage medium and electronic equipment | |
JP5842655B2 (en) | Information processing apparatus, program, and error processing method | |
US11809742B2 (en) | Recovery from HMB loss | |
TWI712052B (en) | Memory management method, storage controller and storage device | |
CN115756946A (en) | File inspection method and device | |
EP2469412B1 (en) | Methods and system for verifying memory device integrity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |