CN114461436A - Memory fault processing method and device and computer readable storage medium - Google Patents

Memory fault processing method and device and computer readable storage medium Download PDF

Info

Publication number
CN114461436A
CN114461436A CN202210362920.2A CN202210362920A CN114461436A CN 114461436 A CN114461436 A CN 114461436A CN 202210362920 A CN202210362920 A CN 202210362920A CN 114461436 A CN114461436 A CN 114461436A
Authority
CN
China
Prior art keywords
memory
fault
memory address
physical memory
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210362920.2A
Other languages
Chinese (zh)
Inventor
张玉峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210362920.2A priority Critical patent/CN114461436A/en
Publication of CN114461436A publication Critical patent/CN114461436A/en
Priority to PCT/CN2022/115340 priority patent/WO2023193396A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Abstract

The application discloses a memory fault processing method and device and a computer readable storage medium, and relates to the technical field of computers. Acquiring a redundant space of a memory by monitoring fault information of the memory of the server, and judging whether the redundant space is smaller than a first threshold value; if not, acquiring a fault physical memory address and a virtual memory address corresponding to the fault physical memory address according to the fault information; isolating the fault physical memory address through a redundancy mechanism of a memory, and acquiring a new physical memory address; data in the failed physical memory address is backed up, and the virtual memory address is mapped to a new physical memory address for migrating the data to the new physical memory address. Therefore, the scheme permanently isolates the fault memory through a redundancy mechanism, and simultaneously changes the mapping position of the virtual memory to isolate the software layer existing in the fault memory, so that the data in the fault memory is not lost; the downtime caused by memory faults is effectively reduced, unnecessary memory replacement is reduced, and the operation and maintenance cost is reduced.

Description

Memory fault processing method and device and computer readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a memory fault, and a computer-readable storage medium.
Background
The server memory is also a memory (RAM) and has some specific technologies, such as Error Correction Code (ECC), so that the server memory has extremely high stability and Error correction performance. Currently, all modern operating systems access the Memory of the server, not directly accessing the physical Memory, but through an intermediate layer, which is called Virtual Memory (VM) in the operating system; and the operating system accesses the corresponding physical memory mapped by the VM through the VM. At the same time, the physical memory address mapped by the virtual memory can be changed, so that the operating system can access the physical memory address.
However, in the operation and use of the server, the server hardware fault diagnosis and fault prediction are pain points in the field of server operation and maintenance, and are also technical difficulties. The server fault caused by the memory accounts for the highest percentage of all faults, so if the server memory fault can be effectively diagnosed and technically isolated, the server fault can be effectively reduced.
In view of the above problems, a reliable memory failure processing method is designed, which is a problem to be solved urgently by those skilled in the art.
Disclosure of Invention
The application aims to provide a memory fault processing method and device and a computer readable storage medium.
In order to solve the above technical problem, the present application provides a memory fault processing method, including:
monitoring fault information of a memory of a server to confirm that the memory has a fault;
acquiring a redundant space of the memory;
judging whether the redundant space is smaller than a first threshold value;
if not, acquiring a fault physical memory address and a corresponding virtual memory address thereof according to the fault information;
isolating the fault physical memory address through a redundancy mechanism of the memory, and acquiring a new physical memory address; wherein the space of the new physical memory address is equal to the space of the failed physical memory address in size;
backing up data in the fault physical memory address corresponding to the virtual memory address;
mapping the virtual memory address to the new physical memory address for migrating the data to the new physical memory address.
Preferably, the monitoring of the fault information of the memory of the server includes:
and monitoring the fault information of the memory through an MCA technology, recording the fault information in an interrupt shielding control register, and generating a fault log.
Preferably, the monitoring of the fault information of the memory of the server includes:
judging whether the quantity of the fault information is greater than a second threshold value within a first preset time; the quantity of the fault information is decreased progressively by taking second preset time as a period, and the second preset time is less than the first preset time;
if yes, confirming that the memory fails, and entering the step of acquiring the redundant space of the memory.
Preferably, the acquiring a physical memory address with a fault and a virtual memory address corresponding to the physical memory address with the fault according to the fault information includes:
analyzing the interrupt shielding control register to obtain the fault physical memory address;
and acquiring the virtual memory address corresponding to the fault physical memory address through a memory management unit according to the fault log.
Preferably, after the mapping the virtual memory address to the new physical memory address, the method further includes:
and marking the fault physical memory address.
Preferably, after the marking the failed physical memory address, the method further includes:
and triggering a memory fault alarm.
Preferably, if it is determined that the redundant space is smaller than the first threshold, the method further includes:
and outputting information for prompting to replace the memory.
In order to solve the above technical problem, the present application further provides a memory fault processing apparatus, including:
the monitoring module is used for monitoring fault information of the memory of the server so as to confirm that the memory has faults;
the first acquisition module is used for acquiring the redundant space of the memory;
the judging module is used for judging whether the redundant space is smaller than a first threshold value or not; if not, triggering a second acquisition module;
the second obtaining module is used for obtaining a fault physical memory address and a virtual memory address corresponding to the fault physical memory address according to the fault information;
the redundancy module is used for isolating the fault physical memory address through a redundancy mechanism of the memory and acquiring a new physical memory address; wherein the space of the new physical memory address is equal to the space of the failed physical memory address in size;
the data backup module is used for backing up data in the fault physical memory address corresponding to the virtual memory address;
a mapping module, configured to map the virtual memory address to the new physical memory address, so as to migrate the data to the new physical memory address.
In order to solve the above technical problem, the present application further provides another memory failure processing apparatus, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the memory fault processing method when the computer program is executed.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the memory fault handling method are implemented.
According to the memory fault processing method, the redundant space of the memory is obtained by monitoring the fault information of the memory of the server, and whether the redundant space is smaller than a first threshold value is judged; if not, acquiring a fault physical memory address and a virtual memory address corresponding to the fault physical memory address according to the fault information; isolating the fault physical memory address through a redundancy mechanism of a memory, and acquiring a new physical memory address; wherein, the space of the new physical memory address is equal to the space of the fault physical memory address in size; and backing up the data in the fault physical memory address corresponding to the virtual memory address, and mapping the virtual memory address to a new physical memory address so as to migrate the data to the new physical memory address. Therefore, the scheme realizes permanent isolation of the fault memory through a memory redundancy mechanism, and can isolate the software layer existing in the fault by changing the mapping position of the virtual memory in the operation process of the operating system without losing data in the fault memory; the downtime rate caused by memory faults can be effectively reduced, unnecessary memory replacement can be effectively reduced, and the cost of operation and maintenance is greatly reduced.
In addition, the embodiment of the application also provides a memory fault processing device and a computer readable storage medium, and the effect is the same as above.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a memory fault processing method according to an embodiment of the present disclosure;
fig. 2 is a flowchart of another memory fault processing method according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a memory fault handling apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of another memory failure processing apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a reliable memory fault processing method, a device and a computer readable storage medium.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.
In the operation and use of the server, the server hardware fault diagnosis and fault prediction are pain points in the field of server operation and maintenance and are also technical difficulties. The server fault caused by the memory accounts for the highest percentage of all faults, so if the server memory fault can be effectively diagnosed and technically isolated, the server fault can be effectively reduced. Therefore, the embodiment of the present application provides a memory fault handling method. Fig. 1 is a flowchart of a memory failure processing method according to an embodiment of the present disclosure. As shown in fig. 1, the method comprises:
s10: and monitoring fault information of the memory of the server to confirm that the memory has faults.
S11: and acquiring redundant space of the memory.
S12: judging whether the redundant space is smaller than a first threshold value; if not, the process proceeds to step S13.
S13: and acquiring a fault physical memory address and a corresponding virtual memory address thereof according to the fault information.
S14: isolating the fault physical memory address through a redundancy mechanism of a memory, and acquiring a new physical memory address; wherein the space of the new physical memory address is equal in size to the space of the failed physical memory address.
S15: and backing up data in the fault physical memory address corresponding to the virtual memory address.
S16: the virtual memory address is mapped to a new physical memory address for migrating data to the new physical memory address.
It will be appreciated that during operation of the server, the memory may fail for some reason. Faults occurring in the memory are classified into two types, one type is a Correctable Error (CE), and the other type is an uncorrectable error (UCE). When CE occurs, the memory can be automatically corrected through an ECC mechanism, but excessive or frequent CE often possibly indicates the occurrence of UCE; if UCE occurs in the memory, the UCE generally accompanies the occurrence of the downtime of the server, and belongs to the serious fault of the server. Therefore, after the CE is found, corresponding processing is carried out according to the situation. Therefore, in this embodiment, the failure information of the memory is monitored first, and the CE serving as the failure information is found, and a corresponding policy is adopted to follow the occurrence of the CE, so as to avoid the occurrence of UCE. In this embodiment, the monitoring method of the fault information is not limited, and is determined according to a specific implementation situation.
And after the information that the memory fails is obtained, acquiring the redundant space of the memory. It can be understood that, when a memory manufacturer produces a memory, in order to prevent a part of physical space of the memory from being damaged and causing the memory to be unusable, the memory space is redundant. For example, a nominal 128M memory granule, the actual available memory space may be 130M; the extra 2M is the redundant space of the memory. Before the memory leaves a factory, a manufacturer can carry out comprehensive test on the memory, find a damaged area of a normal physical memory, and redirect the damaged physical memory space to an area with the same size as a redundant physical memory space in a memory firmware address coding mode. This ensures that 128M space is available. It should be noted that if the damaged space is larger than 2M, the redundancy is not sufficient, and the memory must be discarded. Therefore, in order to determine whether the redundant space of the failed memory can perform redundancy on the failed memory, it is determined whether the redundant space of the memory is smaller than the first threshold value after being acquired. The first threshold is not limited in this embodiment, and is determined according to specific implementation conditions. If the redundant space is not less than the first threshold, the redundant space of the memory is determined to be enough, and the subsequent redundant operation can be carried out.
And after the redundant space of the memory is determined to be enough, acquiring a fault physical memory address and a corresponding virtual memory address thereof according to the fault information. It should be noted that currently, all modern operating systems access memory without directly accessing real physical addresses, and the operating system manages the physical memory through a mechanism called Virtual Memory (VM). Specifically, the program access to the Memory is not a real physical Memory address, but is converted by the operating system through a Memory Management Unit (MMU) address conversion Unit of a Central Processing Unit (CPU), and a mapping relationship exists between the VM and the real physical address. The operating system divides the memory into a plurality of spaces for use by different programs, and the application programs use the memory through the virtual memory address space. Thus, the failing physical memory address of the failing memory also has its corresponding virtual memory address. In this embodiment, the manner of obtaining the failed memory address and the virtual memory address is not limited, and is determined according to specific implementation situations.
Further, after obtaining the failed physical memory address, the failed physical memory address needs to be isolated by a redundancy mechanism of the memory, and a new physical memory address needs to be obtained. Specifically, the redundancy of the memory is realized by a Post Package Repair (PPR) technology. The PPR technology can replace a part of damaged rows in the memory with redundant rows, thereby realizing the redundancy of the memory; the space of the new physical memory address is equal to the space of the failed physical memory address in size, so that the data of the failed physical memory address can be stored. And meanwhile, data in the fault physical memory address corresponding to the obtained virtual memory address is backed up, so that the data in the fault physical memory address is prevented from being lost. And finally mapping the virtual memory address to a new physical memory address to be used for migrating the data in the backed-up fault physical memory address to the new physical memory address, and finally realizing the processing of the fault memory.
In this embodiment, the redundant space of the memory is obtained by monitoring the fault information of the memory of the server, and whether the redundant space is smaller than a first threshold value is judged; if not, acquiring a fault physical memory address and a virtual memory address corresponding to the fault physical memory address according to the fault information; isolating the fault physical memory address through a redundancy mechanism of a memory, and acquiring a new physical memory address; wherein, the space of the new physical memory address is equal to the space of the fault physical memory address in size; and backing up the data in the fault physical memory address corresponding to the virtual memory address, and mapping the virtual memory address to a new physical memory address so as to migrate the data to the new physical memory address. Therefore, the scheme realizes permanent isolation of the fault memory through a memory redundancy mechanism, and can isolate the software layer existing in the fault by changing the mapping position of the virtual memory in the operation process of the operating system without losing data in the fault memory; the downtime rate caused by memory faults can be effectively reduced, unnecessary memory replacement can be effectively reduced, and the cost of operation and maintenance is greatly reduced.
On the basis of the above-described embodiment:
as a preferred embodiment, the monitoring fault information of the memory of the server includes:
and monitoring fault information of the memory through the MCA technology, recording the fault information in an interrupt shielding control register, and generating a fault log.
In the above embodiments, the monitoring manner of the fault information is not limited, and is determined according to specific implementation conditions. As a preferred embodiment, in this embodiment, fault information of a memory is monitored by an MCA technique, the fault information is recorded in an interrupt mask control register, and a fault log is generated.
It is noted that Intel (Intel) adds a mechanism to the CPU from pentium 4 called the hardware error detection Architecture (MCA). It is used to detect hardware errors such as systematic bus errors, Error Correcting Code (ECC) errors, parity errors, etc. This system is implemented by a number of special module registers (MSRs) that are divided into two parts, one for setup and the other for describing hardware errors that occur. In the MCA technical architecture, a Dual-Inline-Memory-Modules (DIMMs) is used as a Memory, which is a new Memory bank appearing after the push of a pentium CPU and provides a 64-bit data channel.
By the MCA technique, details of the occurrence of a failure are recorded in an interrupt mask control register (IMC) regardless of whether CE or UCE occurs. The IMC is a set of registers of the MCA architecture that can be used to store detailed information of faults. If the memory has CE fault, the CPU will report the detailed information of the fault based on MCA technical architecture. In the Linux system, the operating system will record the detailed information of the error into the fault log MCELOG for the subsequent operation.
In this embodiment, the fault information of the memory is monitored by the MCA technology, the fault information is recorded in the interrupt mask control register, and a fault log is generated, so that the fault information is monitored and stored.
On the basis of the above-described embodiment: as a preferred embodiment, the monitoring fault information of the memory of the server includes:
judging whether the quantity of the fault information is greater than a second threshold value within first preset time; the quantity of the fault information is decreased progressively by taking second preset time as a period, and the second preset time is less than the first preset time;
if yes, confirming that the memory fails, and entering the step of obtaining the redundant space of the memory.
In a specific implementation, in the process of monitoring the failure information of the memory of the server, some memory failures that can be corrected or are within an allowable range may be monitored, and therefore, it cannot be determined that the memory has failed due to the occurrence of the failure information. Specifically, in the process of monitoring the fault information of the memory of the server, whether the number of the fault information is greater than a second threshold value within a first preset time is judged, that is, whether the number of the fault information within a time period exceeds an allowable fault number is judged; the number of the fault information is decreased progressively in a second preset time period, the second preset time is shorter than the first preset time, namely when some faults which can be corrected or are within an allowable range occur, the number of the fault information is decreased progressively in a time period, finally the number of the fault information is reset to zero, and the memory fault processing cannot be triggered. In this embodiment, the second threshold is not limited according to the specific implementation, and the first preset time and the second preset time are not limited according to the specific implementation. When the number of the fault information is determined to be greater than the second threshold value within the first preset time, it is determined that the number of the fault information exceeds the allowable range within the first preset time, and therefore it is determined that the memory has a fault, and a subsequent step of processing the memory fault can be performed.
In the embodiment, whether the quantity of the fault information is greater than a second threshold value within a first preset time is judged; the quantity of the fault information is decreased progressively by taking second preset time as a period, and the second preset time is less than the first preset time; if so, confirming that the memory fails, and entering the step of acquiring the redundant space of the memory, so as to realize accurate judgment of whether the memory fails.
On the basis of the above-described embodiment:
as a preferred embodiment, the obtaining the failed physical memory address and the corresponding virtual memory address according to the failure information includes:
analyzing the interrupt shielding control register to obtain a fault physical memory address;
and acquiring a virtual memory address corresponding to the fault physical memory address through a memory management unit according to the fault log.
In the above embodiments, the obtaining manner of the failure memory address and the virtual memory address is not limited, and is determined according to specific implementation conditions. As a preferred embodiment, since the detailed information of the fault is recorded in the interrupt mask control register (IMC), the fault physical memory address is obtained by analyzing the detailed information of the fault in the IMC in this embodiment. In the Linux system, the operating system records the detailed error information into the fault log MCELOG, and the operating system manages the physical memory through the virtual memory, so that the virtual memory address corresponding to the fault physical memory address can be obtained through the MMU address translation unit according to the fault log.
In this embodiment, the failure physical memory address is obtained by analyzing the interrupt mask control register, and the virtual memory address corresponding to the failure physical memory address is obtained by the memory management unit according to the failure log, so that the failure physical memory address and the virtual memory address are obtained, and the mapping address is changed subsequently.
Fig. 2 is a flowchart of another memory failure processing method according to an embodiment of the present disclosure. As shown in fig. 2, after mapping the virtual memory address to the new physical memory address, the method further includes:
s17: the failing physical memory address is marked.
S18: and triggering a memory fault alarm.
In the specific implementation, the physical memory address with the fault is marked in the kernel of the operating system, so that the following application program is ensured not to be allocated to the physical memory, and the memory fault is prevented from occurring again.
And after the fault physical memory address is marked, triggering a memory fault alarm to prompt maintenance personnel to maintain the fault memory.
As shown in fig. 2, in an implementation, if it is determined that the redundant space is smaller than the first threshold, the method further includes:
s19: and outputting information for prompting to replace the memory.
It can be understood that, when the redundant space is determined to be smaller than the first threshold, it is determined that the fault space of the current memory is larger than the redundant space of the memory, and therefore the redundant space is not enough to perform redundant replacement on the fault memory space, and at this time, the memory fault cannot be handled, and only the memory can be replaced. Therefore, when the redundant space is judged to be smaller than the first threshold value, information for prompting the replacement of the memory is output, and therefore maintenance personnel are prompted to replace the failed memory.
In the foregoing embodiment, a detailed description is given of the memory fault processing method, and the present application also provides an embodiment corresponding to the memory fault processing apparatus. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one is from the perspective of the functional module, and the other is from the perspective of the hardware structure.
Fig. 3 is a schematic structural diagram of a memory failure processing apparatus according to an embodiment of the present application, and as shown in fig. 3, the memory failure processing apparatus includes:
the monitoring module 10 is configured to monitor fault information of a memory of the server to determine that the memory has a fault.
The first obtaining module 11 is configured to obtain a redundant space of a memory.
A judging module 12, configured to judge whether the redundant space is smaller than a first threshold; if not, triggering a second acquisition module.
The second obtaining module 13 is configured to obtain a physical memory address with a fault and a virtual memory address corresponding to the physical memory address with the fault according to the fault information.
A redundancy module 14, configured to isolate a failed physical memory address through a redundancy mechanism of the memory, and obtain a new physical memory address; wherein the space of the new physical memory address is equal to the space of the failed physical memory address in size.
And the data backup module 15 is configured to backup data in the failed physical memory address corresponding to the virtual memory address.
A mapping module 16, configured to map the virtual memory address to a new physical memory address, so as to migrate data to the new physical memory address.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
Fig. 4 is a schematic structural diagram of another memory failure processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the memory failure processing apparatus includes:
a memory 20 for storing a computer program.
The processor 21 is configured to implement the steps of the method for processing the memory failure as mentioned in the above embodiments when executing the computer program.
The memory fault handling apparatus provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 21 may be implemented in hardware using at least one of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the memory fault handling method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. Data 203 may include, but is not limited to, data involved in memory failure handling methods.
In some embodiments, the memory failure processing device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the memory failure handling apparatus and may include more or fewer components than those shown.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The memory fault processing method, the memory fault processing device and the computer readable storage medium provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A memory fault handling method is characterized by comprising the following steps:
monitoring fault information of a memory of a server to confirm that the memory has a fault;
acquiring a redundant space of the memory;
judging whether the redundant space is smaller than a first threshold value;
if not, acquiring a fault physical memory address and a corresponding virtual memory address thereof according to the fault information;
isolating the fault physical memory address through a redundancy mechanism of the memory, and acquiring a new physical memory address; wherein the space of the new physical memory address is equal to the space of the failed physical memory address in size;
backing up data in the fault physical memory address corresponding to the virtual memory address;
mapping the virtual memory address to the new physical memory address for migrating the data to the new physical memory address.
2. The memory fault handling method according to claim 1, wherein the monitoring fault information of the memory of the server includes:
and monitoring the fault information of the memory through an MCA technology, recording the fault information in an interrupt shielding control register, and generating a fault log.
3. The memory fault handling method according to claim 1, wherein the monitoring fault information of the memory of the server includes:
judging whether the quantity of the fault information is greater than a second threshold value within a first preset time; the quantity of the fault information is decreased progressively by taking second preset time as a period, and the second preset time is less than the first preset time;
if yes, confirming that the memory fails, and entering the step of acquiring the redundant space of the memory.
4. The method according to claim 2, wherein the obtaining the physical memory address with the fault and the corresponding virtual memory address according to the fault information comprises:
analyzing the interrupt shielding control register to obtain the fault physical memory address;
and acquiring the virtual memory address corresponding to the fault physical memory address through a memory management unit according to the fault log.
5. The method of memory fault handling according to claim 1, further comprising, after said mapping the virtual memory address to the new physical memory address:
and marking the fault physical memory address.
6. The memory fault handling method of claim 5, further comprising, after said marking the faulty physical memory address:
and triggering a memory fault alarm.
7. The memory failure processing method according to any one of claims 1 to 6, further comprising, if it is determined that the redundant space is smaller than a first threshold value:
and outputting information for prompting to replace the memory.
8. A memory failure handling device, comprising:
the monitoring module is used for monitoring fault information of the memory of the server so as to confirm that the memory has faults;
the first acquisition module is used for acquiring the redundant space of the memory;
the judging module is used for judging whether the redundant space is smaller than a first threshold value or not; if not, triggering a second acquisition module;
the second obtaining module is used for obtaining a fault physical memory address and a corresponding virtual memory address according to the fault information;
the redundancy module is used for isolating the fault physical memory address through a redundancy mechanism of the memory and acquiring a new physical memory address; wherein the space of the new physical memory address is equal to the space of the failed physical memory address in size;
the data backup module is used for backing up data in the fault physical memory address corresponding to the virtual memory address;
a mapping module, configured to map the virtual memory address to the new physical memory address, so as to migrate the data to the new physical memory address.
9. A memory failure handling device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the memory failure handling method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the memory fault handling method according to any one of claims 1 to 7.
CN202210362920.2A 2022-04-08 2022-04-08 Memory fault processing method and device and computer readable storage medium Pending CN114461436A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210362920.2A CN114461436A (en) 2022-04-08 2022-04-08 Memory fault processing method and device and computer readable storage medium
PCT/CN2022/115340 WO2023193396A1 (en) 2022-04-08 2022-08-28 Memory fault processing method and device, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210362920.2A CN114461436A (en) 2022-04-08 2022-04-08 Memory fault processing method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114461436A true CN114461436A (en) 2022-05-10

Family

ID=81418248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210362920.2A Pending CN114461436A (en) 2022-04-08 2022-04-08 Memory fault processing method and device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN114461436A (en)
WO (1) WO2023193396A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023193396A1 (en) * 2022-04-08 2023-10-12 苏州浪潮智能科技有限公司 Memory fault processing method and device, and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103197999A (en) * 2013-03-22 2013-07-10 北京百度网讯科技有限公司 Method and device for automatically positioning internal memory fault
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
WO2016115661A1 (en) * 2015-01-19 2016-07-28 华为技术有限公司 Memory fault isolation method and device
CN109086151A (en) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 The method and device of memory failure is isolated on a kind of server
CN112667422A (en) * 2019-10-16 2021-04-16 华为技术有限公司 Memory fault processing method and device, computing equipment and storage medium
CN112667445A (en) * 2021-01-12 2021-04-16 长鑫存储技术有限公司 Method and device for repairing packaged memory, storage medium and electronic equipment
CN113282434A (en) * 2021-07-19 2021-08-20 苏州浪潮智能科技有限公司 Memory repair method based on post-package repair technology and related components
CN113742123A (en) * 2021-08-20 2021-12-03 新华三技术有限公司合肥分公司 Memory fault information recording method and equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10268612B1 (en) * 2016-09-23 2019-04-23 Amazon Technologies, Inc. Hardware controller supporting memory page migration
CN114064333A (en) * 2020-08-05 2022-02-18 华为技术有限公司 Memory fault processing method and device
CN114461436A (en) * 2022-04-08 2022-05-10 苏州浪潮智能科技有限公司 Memory fault processing method and device and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
CN103197999A (en) * 2013-03-22 2013-07-10 北京百度网讯科技有限公司 Method and device for automatically positioning internal memory fault
WO2016115661A1 (en) * 2015-01-19 2016-07-28 华为技术有限公司 Memory fault isolation method and device
CN109086151A (en) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 The method and device of memory failure is isolated on a kind of server
CN112667422A (en) * 2019-10-16 2021-04-16 华为技术有限公司 Memory fault processing method and device, computing equipment and storage medium
CN112667445A (en) * 2021-01-12 2021-04-16 长鑫存储技术有限公司 Method and device for repairing packaged memory, storage medium and electronic equipment
CN113282434A (en) * 2021-07-19 2021-08-20 苏州浪潮智能科技有限公司 Memory repair method based on post-package repair technology and related components
CN113742123A (en) * 2021-08-20 2021-12-03 新华三技术有限公司合肥分公司 Memory fault information recording method and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023193396A1 (en) * 2022-04-08 2023-10-12 苏州浪潮智能科技有限公司 Memory fault processing method and device, and computer readable storage medium

Also Published As

Publication number Publication date
WO2023193396A1 (en) 2023-10-12

Similar Documents

Publication Publication Date Title
US10180866B2 (en) Physical memory fault mitigation in a computing environment
US7971112B2 (en) Memory diagnosis method
US8032816B2 (en) Apparatus and method for distinguishing temporary and permanent errors in memory modules
US8930750B2 (en) Systems and methods for preventing data loss
EP1000395B1 (en) Apparatus and method for memory error detection and error reporting
US9606889B1 (en) Systems and methods for detecting memory faults in real-time via SMI tests
US7971124B2 (en) Apparatus and method for distinguishing single bit errors in memory modules
US20090150721A1 (en) Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System
EP3979079A1 (en) Memory fault handling method and apparatus, device and storage medium
CN114579340A (en) Memory error processing method and device
CN115629905B (en) Memory fault early warning method and device, electronic equipment and readable medium
CN111104051B (en) Method, apparatus and computer program product for managing a storage system
US9965346B2 (en) Handling repaired memory array elements in a memory of a computer system
US11809295B2 (en) Node mode adjustment method for when storage cluster BBU fails and related component
CN115421984A (en) Memory fault processing method and device, electronic equipment and medium
CN114461436A (en) Memory fault processing method and device and computer readable storage medium
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN110058961B (en) Method and apparatus for managing storage system
US20140201566A1 (en) Automatic computer storage medium diagnostics
US7389446B2 (en) Method to reduce soft error rate in semiconductor memory
CN115509786A (en) Method, device, equipment and medium for reporting fault
CN117687833A (en) Method, device and storage medium for testing data security
CN116795573A (en) Memory irrecoverable error processing method, system, equipment and storage medium
CN109343986B (en) Method and computer system for processing memory failure
CN114911659A (en) CE storm suppression method, device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220510