CN115421947A - Memory fault processing method and device and storage medium - Google Patents

Memory fault processing method and device and storage medium Download PDF

Info

Publication number
CN115421947A
CN115421947A CN202210911674.1A CN202210911674A CN115421947A CN 115421947 A CN115421947 A CN 115421947A CN 202210911674 A CN202210911674 A CN 202210911674A CN 115421947 A CN115421947 A CN 115421947A
Authority
CN
China
Prior art keywords
cache block
fault
target
memory
replacement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210911674.1A
Other languages
Chinese (zh)
Inventor
李胜
鲍全洋
张光彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Kunlun Technology Co ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202210911674.1A priority Critical patent/CN115421947A/en
Publication of CN115421947A publication Critical patent/CN115421947A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application discloses a memory fault processing method, a memory fault processing device and a storage medium, relates to the technical field of memories, and is used for improving the efficiency of repairing memory faults. The method comprises the following steps: a Central Processing Unit (CPU) receives a repair request sent by an out-of-band controller; the repair request is used for requesting fault repair on a target cache block in the memory, and the repair request carries the fault severity of the target cache block; when an idle replacement cache block exists, the CPU replaces the target cache block by the idle replacement cache block; when no free replacement cache block exists, the CPU determines a target replacement cache block based on the fault severity of the target cache block; the target replacement cache block is a cache block which replaces a historical fault cache block currently, and the fault severity of the target cache block is greater than or equal to the fault severity of the historical fault cache block; the CPU replaces the target replacement cache block with the target cache block.

Description

Memory fault processing method and device and storage medium
Technical Field
The present application relates to the field of memory technologies, and in particular, to a method and an apparatus for processing a memory fault, and a storage medium.
Background
Currently, an internal memory in a computer device is used as an important storage module for the computer device to acquire or store related data. In this regard, if the memory fails, the operation of the computer device may be affected, for example, data loss or the like. Currently, a fault repairing method is provided, in which a memory fault is repaired by replacing a cache block with a replacement cache block, which is called a local cache line redundancy replacement (PCLS) technique. However, in consideration of the problem of memory cost, the number of replacement cache blocks is limited, and therefore how to efficiently repair a fault in the memory by using the limited replacement cache blocks is a problem to be solved.
Disclosure of Invention
The embodiment of the application provides a memory fault processing method, a memory fault processing device and a storage medium, which are used for improving the efficiency of repairing memory faults.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, a memory failure handling method is provided for processor firmware, and the method includes: a Central Processing Unit (CPU) receives a repair request sent by an out-of-band controller; the repair request is used for requesting fault repair of a target cache block in the memory, and the repair request carries the fault severity of the target cache block; when an idle replacement cache block exists, the CPU replaces the target cache block by the idle replacement cache block; when no free replacement cache block exists, the CPU determines a target replacement cache block based on the fault severity of the target cache block; the target replacement cache block is a cache block which currently replaces a historical fault cache block, and the fault severity of the target cache block is greater than or equal to the fault severity of the historical fault cache block; the CPU replaces the target replacement cache block with the target cache block.
When present memory trouble adopts PCLS technique to restore, the quantity of replacement cache piece is limited, to this, this application proposes when having idle replacement cache piece, restore the cache piece that breaks down, and when there is not idle replacement cache piece, according to the trouble severity of the cache piece that breaks down, in the shared replacement cache piece by historical trouble cache piece, be used for restoring the cache piece that breaks down with the repair resources that the trouble severity is less than the trouble severity of the cache piece that breaks down in the historical trouble cache piece, thereby help being used for repairing the great trouble of possibility that the influence produced the system with limited repair resources, and then promote the efficiency of repairing the memory trouble.
In one possible implementation, the method further includes: when the target cache block is successfully repaired by adopting the idle replacement cache block, storing the corresponding relation between the target cache block and the replacement cache block for replacing the target cache block; and when the target cache block is successfully repaired by adopting the target replacement cache block, updating the corresponding relation between the target replacement cache block and the historical fault cache block into the corresponding relation between the target replacement cache block and the target cache block.
In this possible implementation manner, by updating the correspondence between the cache block that has failed and the repair resource, it is helpful to determine whether the target cache block occupies a limited replacement cache block according to the fault repair result, and further, when a fault occurs next time, determine the fault severity of the cache block that has failed next time and the target cache block, thereby determining the target replacement cache block for the cache block that has failed next time.
In one possible implementation, the target replacement cache block is the cache block that currently replaces the historical failing cache block with the least severity of failure.
In this possible implementation manner, by determining the replacement cache block corresponding to the historical failure cache block with the smallest failure severity as the target replacement cache block, it is helpful to use the replacement cache block for repairing the failure with the higher failure severity, so as to reduce the risk that the target cache block with the higher failure severity affects the system.
In a possible implementation manner, the failure severity of the target cache block is the number of times of failure occurrence of the target cache block within a preset time period, where the preset time period refers to a time period from the start of the server where the memory is located to the present.
In the possible implementation manner, a specific implementation manner for representing the fault severity of the target cache block is provided, and the fault severity is represented by the fault occurrence frequency, so that each fault cache block is evaluated favorably, and the fault with high influence possibility on the system is determined.
In one possible implementation, the fault severity of the target cache block is an output result of a fault model, where the fault model is used to determine the fault severity based on fault information of the target cache block, and the fault information includes at least one of location information, fault occurrence time, and fault occurrence number.
In the possible implementation mode, a specific implementation mode for representing the fault severity of the target cache block is provided, the result is output through the fault model and serves as the fault severity of the target cache block, the user is helped to avoid participating in calculation and data processing, manual errors caused by user operation are avoided, and the accuracy of a fault prediction result is further improved.
In a second aspect, there is provided a memory failure processing apparatus, including: the functional units for executing any one of the methods provided by the first aspect, wherein the actions performed by the respective functional units are implemented by hardware or by hardware executing corresponding software. For example, the apparatus may comprise: a receiving unit and a processing unit; the device comprises a receiving unit, a judging unit and a judging unit, wherein the receiving unit is used for receiving a repair request sent by an out-of-band controller; the repair request is used for requesting fault repair of a target cache block in the memory, and the repair request carries the fault severity of the target cache block. And the processing unit is used for replacing the target cache block by the idle replacing cache block when the idle replacing cache block exists. A processing unit further configured to determine a target replacement cache block based on a severity of a failure of the target cache block when there is no free replacement cache block; the target replacement cache block is a cache block which currently replaces a historical fault cache block, and the fault severity of the target cache block is greater than or equal to the fault severity of the historical fault cache block; and replacing the target replacing cache block with the target replacing cache block.
In a third aspect, a computer device is provided, comprising: a processor and a memory. The processor is connected with the memory, the memory is used for storing computer execution instructions, and the processor executes the computer execution instructions stored by the memory, so as to realize any one of the methods provided by the first aspect.
In a fourth aspect, there is provided a chip comprising: a processor and an interface circuit; the interface circuit is used for receiving the code instruction and transmitting the code instruction to the processor; a processor for executing code instructions to perform any of the methods provided by the first aspect.
In a fifth aspect, a computer-readable storage medium is provided, which comprises computer-executable instructions, which, when executed on a computer, cause the computer to perform any one of the methods provided in the first aspect.
In a sixth aspect, there is provided a computer program product comprising computer executable instructions which, when executed on a computer, cause the computer to perform any one of the methods provided in the first aspect.
For technical effects brought by any implementation manner of the second aspect to the sixth aspect, reference may be made to technical effects brought by a corresponding implementation manner in the first aspect, and details are not described here.
Drawings
FIG. 1 is a schematic diagram of a computer device according to an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a method for repairing a memory failure;
FIG. 3 is a flow chart illustrating a method for repairing a memory failure;
fig. 4 is a schematic flowchart of a memory fault repairing method according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a memory fault repairing method according to an embodiment of the present application;
fig. 6 is an interaction diagram of a memory fault repair system according to an embodiment of the present disclosure;
fig. 7 is a schematic composition diagram of a memory fault repairing apparatus according to an embodiment of the present disclosure.
Detailed Description
In the description of this application, "/" means "or" unless otherwise stated, for example, A/B may mean A or B. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. Further, "at least one" means one or more, "a plurality" means two or more. The terms "first", "second", and the like do not necessarily limit the number and execution order, and the terms "first", "second", and the like do not necessarily limit the difference.
It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention. The system architecture diagram is an architecture diagram of a computer device. Referring to fig. 1, a hardware portion of the computer device includes a processor, an out-of-band controller, and a memory, and a software portion mainly includes an out-of-band management module, processor firmware, and an Operating System (OS) management unit. The out-of-band management module is located in the out-of-band controller, the OS management unit is located in the processor, and the processor firmware may be located in the processor (as shown in fig. 1), or the processor firmware may be located in a firmware chip (not shown in fig. 1) outside the processor.
The out-of-band management module may be a management unit of the non-service module. For example, the server may be remotely maintained and managed through a dedicated data channel by an out-of-band management module, which is completely independent from the operating system of the server and may communicate with a Basic Input Output System (BIOS) and an OS (or OS management unit) through an out-of-band management interface of the server.
For example, the out-of-band management module may include a management unit of an operation state of the computer device, a management system in a management chip outside the processor, a Board Management Controller (BMC) of the computer device, a System Management Module (SMM), and the like. It should be noted that, the specific form of the out-of-band management module in the embodiments of the present application is not limited, and the above is only an exemplary description. In the following embodiments, only the out-of-band management module is taken as the BMC for example.
The processor firmware may also be referred to as a processor firmware program. Specifically, the processor Firmware includes Firmware, basic Input Output System (BIOS), management Engine (ME), microcode or Intelligent Management Unit (IMU), and other Firmware. It should be noted that, the embodiment of the present application does not limit the specific form of the processor firmware, and the above description is only an example. In the following embodiments, only the processor firmware is taken as an example of the BIOS for description.
It should be noted that the out-of-band management module and a part of the management unit or the module and the firmware included in the processor firmware are only examples. In fact, part of the management unit may also be run in the computer as a processor firmware program, for example, SMM may also provide a service for the user to perform BIOS related functions. Similarly, a portion of the processor firmware may also perform BMC related functions as a management unit for non-business modules, such as ME, IMU, etc.
The memory, also called as internal memory or main memory, is installed in a memory slot on a motherboard of the computer device, and the memory controller communicate with each other through a memory channel (channel). The memory has at least one memory column (rank), each memory column is located on one surface of the memory, each memory column includes at least one sub memory column (branch), the memory column or the sub memory column includes a plurality of memory chips (device), each memory chip is divided into a plurality of memory array groups (bank), each memory array group includes a plurality of memory arrays (bank), each memory array is divided into a plurality of memory cells (cell), each memory cell has a row (row) address and a column (column) address, and each memory cell includes one or more bits. That is, as long as a row (row) and a column (column) on the memory array are specified, one memory cell can be located on the memory array. The minimum unit of memory failure is a storage unit on the storage array.
In this embodiment, a storage unit may be referred to as a cache block, and may be referred to as a nibble entry when the cache block occupies four bits, and may be referred to as a byte entry when the cache block occupies eight bits. The row address and the column address of the memory cell are used for indicating the position of the memory cell in the memory. In one division mode, the memory may be divided into a memory chip, a memory array group, a memory array, a memory row/memory column, a memory cell (cache block), and a bit in sequence from an upper level to a lower level, where the addresses of the memory granule, the memory array group, the memory array, the memory row, the memory column, the memory cell (cache block), and the bit on the memory are real physical addresses. In another division manner, a Central Processing Unit (CPU) divides a memory chip into a plurality of memory pages (pages) based on a paging mechanism, where addresses of the memory pages are virtual addresses, and the virtual addresses are converted into real physical addresses.
It should be noted that the system architecture and the application scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
The method provided by the embodiment of the present application is applicable to, but not limited to, dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), and other memories, and the method of the embodiment of the present application does not limit the type of the memory.
Currently, most computer devices support memory checking and error correction of checked errors, i.e. repairing a fault in the memory, by the processor of the computer device. For example, each time the memory performs a read/write task, the processor identifies and repairs a fault in the memory by using an Error Checking and Correcting (ECC) method. The ECC method is used for identifying errors when fewer bits in a memory fail. An error that can be corrected is referred to as a Correctable Error (CE), and may also be referred to as a correctable failure. If the capability of the error correction algorithm is exceeded, for example, if there is a large range of multi-bit failures in the memory, the error correction will fail, and an uncorrected error (UCE) is generated, which may also be referred to as an uncorrectable fault. When the UCE is generated, a system of the computer device may be severely failed, for example, down, and data in the memory may be lost.
In order to repair a cache block with a fault in a memory, embodiments of the present application provide a technique for repairing a memory fault by replacing a cache block with a replacement cache block, where the technique is referred to as a PCLS technique. When the cache block is a nibble entry, replacing the cache block may be referred to as a nibble replacement entry. The replacement cache block is typically stored in the memory controller, and the number of replacement cache blocks is limited in view of the cost of the memory controller. For example, the number of replacement cache blocks for each memory channel is 16. Therefore, how to efficiently repair the memory fault by using the limited replacement cache blocks is an urgent problem to be solved at present. It should be noted that, in the embodiment of the present application, the number of the replacement cache blocks in the memory controller is not limited, and the above is only an exemplary description.
In some embodiments, as shown in fig. 2, the memory fault is repaired by using PCLS technology according to the following steps S201 to S204.
S201, when a fault occurs and the number of times of the fault reaches a threshold value, triggering a fault repairing process.
S202, whether idle PCLS resources exist or not is searched.
It is understood that the PCLS resource refers to the replacement cache block described above. A free PCLS resource refers to a cache block that is not being used to repair the failure.
S203, if the spare PCLS resources exist, the PCLS repairing task is executed.
Specifically, the failed cache block is replaced with the spare PCLS resource.
And if the spare PCLS resources do not exist, the PCLS repairing task is not executed.
It will be appreciated that when there is no spare PCLS resource to replace a failing cache block, the failing cache block will not be repaired and may have an impact on the system.
S204, marking the updated PCLS resources as used.
The PCLS resource flag is used to indicate whether the resource has been used to repair the failing cache block.
In the above-mentioned steps S201 to S204, a basic scheme for repairing the memory fault by using the PCLS technology is described, however, the PCLS resource is usually limited, and in the above-mentioned scheme, after all the PCLS resources are used, the cache block which is subsequently failed cannot be repaired. Obviously, the above method is not favorable for reasonable utilization of the repair resources.
In other embodiments, as shown in FIG. 3, a memory failure is repaired using PCLS techniques according to the following steps S301-S306.
S301, when a fault occurs and the number of times of the fault reaches a threshold value, triggering a fault repairing process.
S302, whether idle PCLS resources exist is searched.
And S303, if the spare PCLS resources exist, executing a PCLS repairing task.
S304, if no spare PCLS resources exist, recovering the oldest PCLS resources, and then executing the PCLS repairing task.
The oldest PCLS resource is a cache block used for repairing a fault at the earliest time among the occupied repair resources. For example, if replacement cache blocks a, b, and c are used to repair the failure in chronological order, replacement cache block a is the oldest PCLS resource.
In step S304, the oldest PCLS resource is recovered to replace the currently failed cache block.
S305, the mark of updating PCLS resources is used.
S306, updating the used PCLS resource set.
It can be understood that the used PCLS resource set includes a plurality of replacement cache blocks, and the replacement cache blocks respectively correspond to the used times, or the replacement cache blocks are arranged according to the used time sequence, and when other cache blocks in the subsequent memory fail, the repair resource is recovered in the used PCLS resource set.
The above steps S301 to S306 describe a scheme for recovering PCLS resources to repair memory failures by using PCLS technology. However, the strategy for recovering PCLS resources in this scheme is determined based on the time sequence in which PCLS resources are used, and the way in which the first-to-repair-failed cache block is used to repair the currently failed cache block is also disadvantageous for the rational utilization of repair resources.
In view of the above, in the following examples, an embodiment of the present application provides a memory fault processing method, where according to a fault severity of a target cache block, when there is no spare replacement cache block for repairing a fault, a historical fault cache block whose fault severity is smaller than the fault severity of the target cache block is determined from historical fault cache blocks that already occupy repair resources, and then the target cache block is repaired according to a replacement cache block occupied by the historical fault cache block. The method for repairing the cache block is determined according to the fault severity of the cache block with the fault, so that the utilization rate of the repair resources is improved, and the possibility that the fault affects the system is reduced.
As shown in fig. 4, a flowchart of a memory failure processing method provided in the embodiment of the present application includes steps S401 to S403.
S401, the CPU obtains fault information of the memory.
Wherein the failure information indicates that at least one cache block failed. Any one of the at least one cache block may be referred to as a target cache block. The failure information includes location information of at least one cache block failing in the memory. For example, the above is used to indicate the address information of the memory cell in the memory.
Optionally, the failure information includes a failure occurrence time of the cache block. The fault occurrence time is used for the subsequent out-of-band controller to judge the occurrence frequency of the cache block in the preset time period. Illustratively, when the fault occurrence time is within a preset time period, the fault occurring at the fault occurrence time of the cache block is counted, and when the fault occurrence time is not within the preset time period, the fault is not counted.
It is to be understood that the fault information is used to indicate basic information of the fault occurrence, and the information included in the fault information is only an example, and more or less information may be included, and the present application is not limited thereto.
In the related art, each time the memory executes a read/write task, the CPU may perform fault detection on the memory based on an ECC method, and correct a detected error if a fault is detected.
S402, the CPU sends fault information to the out-of-band controller.
Step S402 is used for diagnosing the fault by the out-of-band controller according to the fault information.
Specifically, steps S501-S503 as shown in fig. 5 are included.
S501, judging whether the cache block is a single-point fault by the out-of-band controller.
Optionally, the out-of-band controller determines whether the target cache block is a single point of failure according to the location information of the failure of the at least one cache block. And the out-of-band controller determines the fault type of at least one cache block according to the position information of the fault of the cache block, and determines the cache block with the fault type of a single point fault as a target cache block.
Specifically, the out-of-band controller determines whether the target cache block is a single point of failure by determining whether the address information of the target cache block is different from the address information of other failed cache blocks. Wherein the other failing cache blocks include at least one cache block indicated in the failure information and historical failing cache blocks. A historical faulting cache block refers to a cache block that has failed and is repaired before the target cache block fails. Wherein, judging whether the address information is different comprises: whether the row address of the target cache block is different from the row addresses of the other failing cache blocks, and whether the column address of the target cache block is different from the column addresses of the other failing cache blocks. And when the row address of the target cache block is different from the row addresses of other failed cache blocks and the column address of the target cache block is different from the column addresses of other failed cache blocks, determining that the target cache block is a single point of failure.
It will be appreciated that when multiple (e.g., two) failed cache blocks are present in the same row or column of memory, the out-of-band controller may diagnose that the memory row or column is more likely to fail, and then need to perform failure repair on the memory row or column by using other techniques instead of replacing the failed cache block with a replacement cache block by PCLS techniques.
In some embodiments, when the CPU sends the fault information to the out-of-band controller, and the out-of-band management module determines that the fault type of the cache block is not a single point fault according to the fault information, the fault information of the cache block may be returned to the CPU. Or the out-of-band controller feeds back the information that the fault type of the cache block is not the single point fault to the CPU, so that the CPU determines the repair mode of the target cache block through other modes.
Specifically, when the out-of-band controller determines that the cache block is a single point of failure, the cache block is a target cache block, the out-of-band controller determines to repair the target cache block based on PCLS technology, and the following step S502 is executed; when the out-of-band controller determines that the cache block is not a single point of failure, no PCLS repair is performed.
S502, determining the fault severity of the target cache block by the out-of-band controller.
Optionally, the failure severity of the target cache block is used to indicate the probability that the target cache block failed to transition to UCE.
In one example, the out-of-band controller may determine the fault severity of the target cache block by counting the number of occurrences of the fault for the target cache block. Wherein the failure occurrence number is used for representing the failure severity of the target cache block.
It can be understood that, in a general case, the memory repairs a target cache block, if the repair is successful, the fault is marked as CE, and subsequently, if the cache block fails repeatedly and is repaired for multiple times, in order to avoid that the fault of the cache block is changed from CE to UCE, the memory may consider that the PCLS technology is adopted to replace the failed cache block with a replacement cache block, so as to solve the problem that the cache block fails repeatedly.
Optionally, the out-of-band controller counts the number of times of occurrence of the fault within a preset time period according to the time of occurrence of the fault of the target cache block; further, the frequency of occurrence of the target cache block is determined according to the number of occurrences of the fault within a preset time period. By calculating the occurrence frequency of the faults, the faults with higher occurrence frequency in the preset time period can be determined, and the fault severity degree in the preset time period is higher. For example, the preset time period refers to a time period from the start of the server in which the memory is located to the acquisition of the fault information.
In one example, the out-of-band controller inputs fault information for the target cache block into a fault model and outputs a fault severity for the target cache block. Wherein the fault information includes one or more of location information, time of occurrence of the fault, and number of occurrences of the fault.
The fault model is a machine learning model trained in advance, and the training process of the fault model can be iterative training through training samples and sample labels, wherein the training samples comprise fault information of a plurality of memories, and the sample labels comprise the fault severity of each training sample. The fault training model may be an Artificial Intelligence (AI) fault training model.
In the embodiment, the fault model is trained in advance, so that the fault severity of the target memory can be obtained by inputting the fault information of the target memory into the fault model without the participation of a user in calculation and data processing, thereby not only improving the speed of fault prediction, but also avoiding manual errors caused by user operation, and further improving the accuracy of a fault prediction result.
In some embodiments, the machine learning model may be a hierarchical threshold algorithm, or one or more of machine learning algorithms such as a random forest, gradient descent decision tree (GBDT), extreme gradient ascent (XGBoost), naive bayes, support Vector Machine (SVM), etc., or one or more of deep learning algorithms such as a Convolutional Neural Network (CNN), long-short term neural network (LSTM), etc., or one or more of federal learning optimization-type algorithms such as a federal averaging (FedAvg), fedProx, fedCS, etc.
S503, the out-of-band controller sends a repair request to the CPU.
The repair request is used for requesting fault repair of a target cache block in the memory, and the repair request carries the fault severity of the target cache block.
It is to be understood that the task instruction message may also be understood as a fault diagnosis result that the out-of-band controller feeds back to the CPU after diagnosing the single point fault.
S403, the CPU receives the repair request sent by the out-of-band controller, and repairs the target cache block according to the repair request.
Specifically, step S403 includes the following steps S403a-S403d.
S403a, the CPU determines whether there is a free replacement cache block.
Wherein, the spare replacement cache block refers to the available repair resource provided in the current PCLS technology. It can be understood that, when repairing a failing cache block based on PCLS technology, the failing cache block is replaced with a replacement cache block, and when the replacement cache block is limited, the number of failures that can be repaired is also limited.
Specifically, if the CPU determines that there is a free replacement cache block, step S403b is executed; if the CPU determines that there is no free replacement cache block, it executes step S403c.
S403b, the CPU repairs the target cache block by using the free replacement cache block.
Specifically, the CPU repairs the target cache block in a manner that the target cache block is replaced by a free replacement cache block.
It can be understood that, when a cache block in the memory fails, the CPU cannot read or write data in the cache block, and a repair manner in which a replacement cache block replaces a target cache block is adopted, specifically, a mapping relationship in which the target cache block is processed by the CPU is adjusted to a mapping relationship in which the replacement cache block is processed by the CPU, so that a subsequent CPU can repair the memory failure by reading or writing the replacement cache block. Specifically, the mapping relationship may be location information of a cache block corresponding to data read or written by the CPU.
S403c, the CPU determines a target replacement cache block based on the fault severity of the target cache block, and replaces the target replacement cache block with the target cache block.
Wherein the target replacement cache block is used to replace the target cache block, the target replacement cache block being a cache block that currently replaces a historical failing cache block. The failure severity of the target cache block is greater than or equal to the failure severity of the historical failed cache blocks.
It can be understood that, in the step S403, the CPU obtains the failure severity of the cache block in which the failure occurs, further, compares the failure severity of the target cache block with the failure severity of the historical failure cache block, and if there is a historical failure cache block whose failure severity is lower than the failure severity of the target cache block, the replacement cache block for repairing the historical failure cache block with lower failure severity may be recovered, and the target replacement cache block is determined to be used for replacing the target cache block. There may be a plurality of historical failing cache blocks with a failure severity lower than the target cache block, and accordingly, there may be a plurality of recoverable repair resources.
In one example, the CPU determines a priority sequence based on the severity of the fault for historical fault cache blocks, where historical fault cache blocks with high severity of fault correspond to high priority and historical fault cache blocks with low severity of fault correspond to low priority. Further, the CPU compares the fault severity of the target cache block with the priority sequence, and determines the position of the target cache block in the priority sequence according to the fault severity of the target cache block. If the priority of the target cache block is high, it indicates that the target cache block has a high possibility of affecting the system. If the priority of the target cache block is low, it means that the target cache block has less influence on the system.
Wherein the priority sequence is used for representing the fault severity of the target cache block. Illustratively, if the failure severity of each historical failure cache block is identified by the failure occurrence times, and if the historical failure cache block a fails 6 times, the historical failure cache block B fails 8 times, and the historical failure cache block C fails 15 times, the historical failure cache blocks are sorted according to the failure severity from high to low, so as to obtain a priority sequence of C, B, A. That is to say, the fault severity of the historical fault cache block C is higher than that of the historical fault cache block B, and the fault severity of the historical fault cache block B is higher than that of the historical fault cache block a.
It can be understood that, the number of the current replacement cache blocks is limited, and by comparing the failure severity of the target cache block with the failure severity of the historical failure cache block, it is helpful to flexibly allocate the limited replacement cache blocks to repair the cache block with higher failure severity, thereby avoiding the impact of the failure on the system as much as possible.
In one example, the CPU determines one or more historical failing cache blocks having a lower priority than the target cache block based on the failure severity of the target cache block, and further determines replacement cache blocks used by the one or more historical failing cache blocks. The CPU determines a target replacement cache block among the replacement cache blocks used by the one or more historically failing cache blocks, the target replacement cache block being used to repair the target cache block when there are no free replacement cache blocks.
In one example, the CPU determines a least severe historical failing cache block of the one or more historical failing cache blocks, and treats a replacement cache block used by the historical failing cache block as a target replacement cache block. Or the CPU selects a replacement cache block occupied by any one historical fault cache block from one or more historical fault cache blocks as a target replacement cache block.
It can be understood that, for a target cache block, the CPU determines a priority sequence by determining the fault severity of the target cache block and the fault severity of a historical fault cache block, if the fault severity of the target cache block is higher and all replacement cache blocks have been used to repair the fault of the historical fault cache block, the CPU recovers the replacement cache block corresponding to the historical fault cache block with the lowest fault severity in the priority sequence, and the recovered replacement cache block is used as the target replacement cache block, and when there is no spare replacement cache block in the repair process, the target replacement cache block is used to repair the target cache block.
In one example, the CPU determines a reclaimable replacement cache block to be a reclaimable identification indicating that the replacement cache block is a target replacement cache block. In particular, the recyclable flag includes recyclable and non-recyclable, such as by 0 and 1 representing recyclable and non-recyclable, respectively. Or, the CPU determines a range of the recoverable identifier, where the range is used to indicate the number of replacement cache blocks in the repair resource, for example, if the number of the replacement cache blocks is 16, the range of the recoverable identifier is 0-15, and when there are multiple recoverable repair resources, the recoverable identifiers are 0,1, … …, n, n are less than or equal to 15 according to the order of the severity of the fault corresponding to the multiple recoverable repair resources from small to large.
And S403d, when the recoverable repair resource exists, the CPU repairs the target cache block by utilizing the recoverable repair resource.
Specifically, the CPU replaces the target cache block with the target replacement cache block for repair.
The repair method of replacing the target cache block with the target replacement cache block is the same as the repair method of replacing the target cache block with the idle replacement cache block, and is not described again.
Optionally, the method further includes step S403e, when there is no recoverable repair resource, the CPU does not repair the target cache block.
Optionally, after step S403e, step S403f is further included, where the CPU returns failure information for indicating that the failed cache block is not repaired.
Through the above steps S401 to S403, the CPU determines a repair mode for the target cache block by receiving the fault severity of the target cache block sent by the out-of-band controller, wherein when there is no spare repair resource, a recoverable repair resource may be determined by the fault severity, and a target replacement cache block is determined in the recoverable repair resource to be repaired for the target cache block, wherein the recoverable repair resource is a repair resource occupied by a history fault cache block. That is, the occupied repair resources are recovered according to the severity of the fault as the target cache block for repair. The method is beneficial to reducing the possibility of influence of the memory fault on the system and improving the repair efficiency of the memory fault.
Optionally, after step S403b, the method further includes: the CPU stores a correspondence between the target cache block and a replacement cache block that replaces the target cache block. The fault repairing result may further include the fault information, such as location information, fault occurrence time, fault severity, and the like.
Optionally, after step S403c, the method further includes: the CPU updates the corresponding relation between the target replacement cache block and the historical fault cache block into the corresponding relation between the target replacement cache block and the target cache block. The tag information of the target replacement cache block may be the above-mentioned reclaimable identifier.
It can be understood that, when the CPU stores the above correspondence for the subsequent other cache blocks failing, and the failure severity is higher than the failure severity of the target cache block, the replacement cache block for repairing the target cache block may be used to repair the subsequent cache block failing, and the correspondence may be modified accordingly.
The methods performed in fig. 4 and fig. 5 can be applied to the memory failure recovery system shown in fig. 6, which at least includes an out-of-band controller and a CPU. Including steps S601-S612.
S601, the CPU receives the fault information of the memory and reports the fault information to the out-of-band controller.
Steps S401 and S402 described above.
And S602, the out-of-band controller diagnoses whether the single point fault exists.
As in step S501 above.
And S603, when the out-of-band controller determines that the single point of failure exists, determining the failure severity of the target cache block.
As in step S502 above.
S604, the out-of-band controller sends a repair request to the CPU.
As in step S503 above.
S605, the CPU searches whether the idle repair resources exist.
If yes, step S610 is executed, and PCLS repair is executed.
As described above in steps S403a-S403b.
S606, when no free repair resource exists, whether recoverable repair resource exists is determined.
As in step S403c above.
S607, the CPU compares the fault severity of the target cache block and the historical fault cache block.
S608, the CPU determines whether the fault severity of the historical fault cache block is smaller than that of the target cache block.
And S609, when the recovery resources exist, the CPU determines that the recovery resources can be recovered.
Reference may be made to the above-mentioned step S203 c.
When not present, step S611 is executed, and PCLS repair is not executed.
S610, the CPU executes PCLS repairing tasks.
S612, the CPU updates and repairs the corresponding relation between the resources and the cache block with the fault.
In the above scheme, the CPU is configured to execute the repair action logic and the recovery action logic, and the out-of-band controller is configured to execute the failure diagnosis logic. The repair action logic is used for responding to the processing of the hardware generated faults, reporting fault information to the fault diagnosis logic and processing fault repair tasks issued by the diagnosis logic. And the recovery action logic is used for recovering the specified recovery resources to complete the subsequent recovery tasks. And the fault diagnosis logic is used for receiving the reported fault information and identifying fault characteristics.
By executing the method, the CPU and the out-of-band controller respectively execute fault diagnosis and fault repair, determine available resources for a target cache block in the replacement cache blocks occupied by the historical fault cache blocks, and help to realize repair according to the fault severity of the cache block with the fault, thereby helping to realize reasonable utilization of limited repair resources to repair more serious faults and improve the utilization rate of the repair resources of the memory fault.
The above description has been directed primarily to the embodiments of the present application from a methodological perspective. It is to be understood that the memory failure handling apparatus includes at least one of a hardware structure and a software module corresponding to each function in order to implement the above functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, the memory failure processing apparatus may be divided into the functional units according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
Fig. 7 shows a schematic diagram of a possible structure of the memory failure processing apparatus according to the foregoing embodiment, in a case where each functional module is divided according to each function. As shown in fig. 7, the memory failure processing apparatus 70 includes a receiving unit 701 and a processing unit 702.
A receiving unit 701, configured to receive a repair request sent by an out-of-band controller; the repair request is used for requesting fault repair of a target cache block in the memory, and the repair request carries the fault severity of the target cache block.
A processing unit 702, configured to replace a target cache block with a free replacement cache block when the free replacement cache block exists.
A processing unit 702, further configured to determine a target replacement cache block based on a fault severity of the target cache block when there is no free replacement cache block; the target replacement cache block is a cache block which currently replaces a historical fault cache block, and the fault severity of the target cache block is greater than or equal to the fault severity of the historical fault cache block; and replacing the target replacing cache block with the target replacing cache block.
In an example, the processing unit 702 is further configured to, when the target cache block is successfully repaired with a free replacement cache block, save a correspondence between the target cache block and a replacement cache block that replaces the target cache block; and when the target cache block is successfully repaired by adopting the target replacement cache block, updating the corresponding relation between the target replacement cache block and the historical fault cache block into the corresponding relation between the target replacement cache block and the target cache block.
In one example, the target replacement cache block is the cache block that currently replaces the historical failing cache block with the least severity of failure.
In one example, the severity of the failure of the target cache block is the number of times of failure occurrence of the target cache block within a preset time period, where the preset time period is a time period from the start of the server where the memory is located to the present.
In one example, the fault severity of the target cache block is an output of a fault model, the fault model being used to determine the fault severity based on fault information of the target cache block, the fault information including at least one of location information, time of occurrence of the fault, and number of occurrences of the fault.
All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.
Certainly, the memory failure processing apparatus provided in the embodiment of the present application includes, but is not limited to, the above units, for example: a memory unit 703 may also be included.
The storage unit 703 may be used to store program codes and data of the memory failure handling apparatus.
Embodiments of the present application also provide a computer-readable storage medium having a computer program stored thereon, which, when run on a computer, causes the computer to perform any one of the methods provided above.
For the explanation and the description of the beneficial effects of any of the computer-readable storage media provided above, reference may be made to the corresponding embodiments described above, and details are not repeated here.
The embodiment of the application also provides a chip. The chip integrates a control circuit and one or more ports for realizing the functions of the memory fault processing device. Optionally, the functions supported by the chip may refer to the above, and are not described herein again. Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments may be implemented by a program instructing the associated hardware to perform the steps. The program may be stored in a computer-readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an Application Specific Integrated Circuit (ASIC), a microprocessor (DSP), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
Embodiments of the present application further provide a computer program product containing instructions, which when executed on a computer, cause the computer to execute any one of the methods in the foregoing embodiments. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), among others.
It should be noted that the above devices for storing computer instructions or computer programs provided in the embodiments of the present application, such as, but not limited to, the above memories, computer readable storage media, communication chips, and the like, are all nonvolatile (non-volatile).
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations may be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (7)

1. A memory fault handling method is characterized by comprising the following steps:
a Central Processing Unit (CPU) receives a repair request sent by an out-of-band controller; the repair request is used for requesting fault repair of a target cache block in a memory, and the repair request carries the fault severity of the target cache block;
when an idle replacing cache block exists, the CPU replaces the target cache block by the idle replacing cache block;
when there is no free replacement cache block, the CPU determining a target replacement cache block based on a severity of a failure of the target cache block; the target replacement cache block is a cache block which currently replaces a historical fault cache block, and the fault severity of the target cache block is greater than or equal to the fault severity of the historical fault cache block; and the CPU replaces the target cache block with the target replacement cache block.
2. The method of claim 1, further comprising:
when the target cache block is successfully repaired by adopting the idle replacement cache block, storing the corresponding relation between the target cache block and the replacement cache block replacing the target cache block;
and when the target cache block is successfully repaired by adopting the target replacement cache block, updating the corresponding relation between the target replacement cache block and the historical fault cache block into the corresponding relation between the target replacement cache block and the target cache block.
3. The method of claim 1 or 2, wherein the target replacement cache block is a cache block that currently replaces a historical failing cache block with a least severity of failure.
4. The method according to any one of claims 1 to 3, wherein the severity of the failure of the target cache block is the number of times of failure of the target cache block within a preset time period, and the preset time period is a time period from the start of a server in which the memory is located.
5. The method according to any of claims 1-3, wherein the severity of the fault of the target cache block is an output result of a fault model, and the fault information comprises at least one of location information, time of occurrence of the fault, and number of occurrences of the fault.
6. The memory fault processing device is characterized by comprising a memory and a processor; the memory is used for storing program codes; the processor is configured to invoke the program code to perform the method of any of claims 1-5.
7. A computer-readable storage medium, comprising program code which, when run on a computer or processor, causes the computer or processor to perform the method of any of claims 1-5.
CN202210911674.1A 2022-07-30 2022-07-30 Memory fault processing method and device and storage medium Pending CN115421947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210911674.1A CN115421947A (en) 2022-07-30 2022-07-30 Memory fault processing method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210911674.1A CN115421947A (en) 2022-07-30 2022-07-30 Memory fault processing method and device and storage medium

Publications (1)

Publication Number Publication Date
CN115421947A true CN115421947A (en) 2022-12-02

Family

ID=84196119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210911674.1A Pending CN115421947A (en) 2022-07-30 2022-07-30 Memory fault processing method and device and storage medium

Country Status (1)

Country Link
CN (1) CN115421947A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024027325A1 (en) * 2022-07-30 2024-02-08 超聚变数字技术有限公司 Memory fault handling methods and apparatuses, and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024027325A1 (en) * 2022-07-30 2024-02-08 超聚变数字技术有限公司 Memory fault handling methods and apparatuses, and storage medium

Similar Documents

Publication Publication Date Title
JP7158586B2 (en) Hard disk failure prediction method, apparatus and storage medium
CN100363907C (en) Self test method and apparatus for identifying partially defective memory
US8108724B2 (en) Field replaceable unit failure determination
US4209846A (en) Memory error logger which sorts transient errors from solid errors
US20220148674A1 (en) Memory fault handling method and apparatus, device, and storage medium
CN114968652A (en) Fault processing method and computing device
US20230185659A1 (en) Memory Fault Handling Method and Apparatus
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
WO2024027325A1 (en) Memory fault handling methods and apparatuses, and storage medium
CN115421947A (en) Memory fault processing method and device and storage medium
CN115394344A (en) Method and device for determining memory fault repair mode and storage medium
US8176388B1 (en) System and method for soft error scrubbing
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN117668706A (en) Method and device for isolating memory faults of server, storage medium and electronic equipment
CN115080331A (en) Fault processing method and computing device
WO2022217795A1 (en) Method and apparatus for repairing fail location
JP3931757B2 (en) Shared cache memory failure handling method
CN114020525A (en) Fault isolation method, device, equipment and storage medium
CN210136722U (en) Memory device
KR20210158317A (en) Storage device block-level failure prediction-based data placement
CN113391937A (en) Method, electronic device and computer program product for storage management
CN115269245B (en) Memory fault processing method and computing device
CN115658358A (en) Memory fault processing method and computer equipment
CN115391072A (en) Memory fault processing method, system and storage medium
CN115391075A (en) Memory fault processing method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231110

Address after: 450046, 10th Floor, North Chuangzhi Tiandi Building, Shigeng Street, Longzihu Wisdom Island Middle Road East, Zhengdong New District, Zhengzhou City, Henan Province

Applicant after: Henan Kunlun Technology Co.,Ltd.

Address before: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province

Applicant before: Super fusion Digital Technology Co.,Ltd.

TA01 Transfer of patent application right