CN116483630A

CN116483630A - Memory fault repairing method

Info

Publication number: CN116483630A
Application number: CN202310258520.1A
Authority: CN
Inventors: 墨芹; 张光彪; 陈祝荣
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-07-25

Abstract

The application discloses a memory fault repairing method, relates to the technical field of memory fault repairing, and is used for improving the success rate of memory fault repairing and avoiding downtime of a system. The method comprises the following steps: acquiring fault information, wherein the fault information comprises a fault address; acquiring redundant resource information based on the fault address, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; and when the redundant resource information indicates that the redundant resource corresponding to the fault exists, repairing the fault by using the redundant resource.

Description

Memory fault repairing method

Technical Field

The application relates to the technical field of memory fault repair, in particular to a memory fault repair method.

Background

The memory is a part which is not available when the computing device runs and processes and is used for the computing device to acquire or store data. When a memory fails, the overall operation of the computing device may be affected. Currently, in order to repair faults occurring in a memory, a memory repair mode of post-encapsulation repair (post package repair, PPR) is proposed, and PPR redundancy resources are adopted to replace resources for line faults in the memory, so that data reading and writing of the computing equipment are realized through the PPR redundancy resources. However, in the current repair process, the computing device cannot learn the remaining condition of the PPR redundancy resource, so that when the PPR redundancy resource is used up, the fault repair cannot be completed through the PPR, which results in repair failure and affects the normal operation of the computing device.

Disclosure of Invention

The embodiment of the application provides a memory fault repairing method which is used for improving the success rate of memory fault repairing and avoiding downtime of a system.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, a memory failure repair method is provided, where the method includes: acquiring fault information, wherein the fault information comprises a fault address of a fault; acquiring redundant resource information based on the fault address fault information, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; and when the redundant resource information indicates that the redundant resource corresponding to the fault exists, repairing the fault by using the redundant resource.

When the PPR is adopted for memory fault repair, the computing equipment cannot acquire the residual condition of the PPR redundant resources, so that when the PPR redundant resources are used up, the fault repair cannot be completed through the PPR, the repair fails, and the normal operation of the computing equipment is affected. By the method provided by the embodiment of the application, the computing equipment is helped to acquire the residual condition of the redundant resource, and the restoration is executed under the condition that the redundant resource exists is determined, so that the invalid operation is prevented from being executed, and the computing resource is wasted; further, the repair efficiency of the memory faults is improved, and the system downtime is avoided.

In one possible implementation, obtaining redundant resource information based on a failure address includes: determining a fault granularity of the fault based on the fault address; wherein the failure granularity comprises a row failure; redundant resource information is obtained based on the failure granularity.

The possible implementation manner is beneficial to acquiring redundant resource information corresponding to the fault granularity after determining the fault granularity of the fault, and is beneficial to determining a corresponding repair mode according to the redundant resource information, so that the fault repair efficiency is improved.

In one possible implementation, when the redundant resource information indicates that there is a redundant resource corresponding to the fault, repairing the fault with the redundant resource includes: adopting a first repairing mode, and repairing the fault by using redundant resources; the redundant resources in the first repair mode are recyclable redundant resources.

The possible implementation mode is beneficial to recovering the redundant resources subsequently after the redundant resources are recovered aiming at row faults, and is applied to the recovery of other faults, so that the recovery efficiency of the redundant resources is improved.

In one possible implementation, the method further includes: restarting a basic input output system BIOS; adopting a second repairing mode, and repairing the fault by using redundant resources; the redundant resources in the second repair mode are non-recoverable redundant resources.

This possible implementation helps to permanently repair the row failure with redundant resources, thereby avoiding the failure affecting the operational state of the computing device.

In one possible implementation, when the redundant resource information indicates that the redundant resource corresponding to the fault exists, the fault information fault address repairs the fault by using the redundant resource, including: determining the severity of the fault according to the fault address; and when the severity is greater than the preset severity, repairing the fault by using the redundant resource.

According to the possible implementation mode, the fault is repaired by judging the severity of the fault and adopting the mode of replacing the redundant resources under the condition of higher severity, so that the reasonable utilization rate of the redundant resources is improved, the waste of limited redundant resources is avoided, and the fault cannot be repaired in time when serious faults occur.

In one possible implementation, when the redundant resource information indicates that there is a redundant resource corresponding to the fault, repairing the fault with the redundant resource includes: determining the residual quantity of the redundant resources according to the redundant resource information; and when the residual quantity of the redundant resources is larger than a preset threshold value, repairing the fault by using the redundant resources.

The possible implementation manner provides a judging condition for repairing by adopting the redundant resources, and by judging the residual quantity of the redundant resources, under the condition that the redundant resources are sufficient, the redundant resources are directly adopted for repairing, and under the condition that the redundant resources are insufficient, the severity degree can be combined or other existing repairing modes can be adopted for repairing, so that the utilization efficiency of the redundant resources is improved.

In a second aspect, a memory fault repair method is provided, applied to a processor, and the method includes: the processor acquires fault information, wherein the fault information comprises a fault address of a fault; the processor acquires redundant resource information based on the fault address fault information, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; the processor sends the fault information and the redundant resource information to the out-of-band controller so as to instruct the out-of-band controller to determine a repairing mode adopted by repairing the line fault based on the redundant resource information; the processor repairs the fault based on the repair mode.

When the PPR is adopted for memory fault repair, the computing equipment cannot acquire the residual condition of the PPR redundant resources, so that when the PPR redundant resources are used up, the fault repair cannot be completed through the PPR, the repair fails, and the normal operation of the computing equipment is affected. By the method provided by the embodiment of the application, a processor in the computing equipment is helped to acquire the residual condition of the PPR redundant resource and feed back to the out-of-band controller, and the out-of-band controller executes PPR repair under the condition that the redundant resource exists, so that invalid operation is avoided, and the computing resource is wasted; further, the repair efficiency of the memory faults is improved, and the system downtime is avoided.

In one possible implementation, the processor obtains redundant resource information based on the failure address, including: the processor determines the fault granularity of the fault based on the fault address; wherein the failure granularity comprises a row failure; the processor obtains redundant resource information based on the failure granularity.

In one possible implementation, the processor repairs the fault based on a repair method, including: the processor adopts a first repairing mode, and the fault is repaired by using redundant resources, wherein the redundant resources in the first repairing mode are recyclable redundant resources.

In one possible implementation, the method further includes: restarting the basic input output system BIOS by the processor; the processor adopts a second repairing mode, and the fault is repaired by using redundant resources; the redundant resources in the second repair mode are non-recoverable redundant resources.

In a third aspect, a memory fault repair method is provided, applied to an out-of-band controller, and the method includes: the out-of-band controller receives fault information and redundant resource information, wherein the fault information comprises a fault address of a fault, and the redundant resource information indicates whether redundant resources corresponding to the fault exist or not; the out-of-band controller determines a repairing mode adopted for repairing the fault based on the redundant resource information; the out-of-band controller instructs the processor to repair the fault based on the repair mode.

In one possible implementation, the fault information fault out-of-band controller determines a repair mode adopted for repairing the fault based on the redundant resource information, including: the out-of-band controller determines the severity of the fault according to the fault address; and when the severity is greater than the preset severity, the out-of-band controller determines to repair the fault by using the redundant resource.

In one possible implementation, the determining, by the out-of-band controller, a repair manner for repairing the fault based on the redundant resource information includes: the out-of-band controller determines the residual quantity of the redundant resources according to the redundant resource information; and when the residual quantity of the redundant resources is larger than a preset threshold value, the out-of-band controller determines to repair the fault by using the redundant resources.

In a fourth aspect, a memory failure repair apparatus is provided, including: the functional units for executing any of the methods provided in the first aspect, and actions executed by the respective functional units are implemented by hardware or implemented by hardware executing corresponding software. For example, the apparatus may include: an acquisition unit and a processing unit. The acquisition unit is used for acquiring fault information, wherein the fault information comprises a fault address of a fault; the acquisition unit is also used for acquiring redundant resource information based on the fault address fault information, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; and the processing unit is used for repairing the line fault by using the redundant resource when the redundant resource information indicates that the redundant resource corresponding to the fault exists.

In a fifth aspect, a memory failure repair apparatus is provided, including: functional units for performing any of the methods provided in the second aspect, the actions performed by the respective functional units are implemented by hardware or by hardware executing corresponding software. For example, the apparatus may include: the device comprises an acquisition unit, a sending unit and a processing unit. The acquisition unit is used for acquiring fault information, wherein the fault information comprises a fault address of a fault; the acquisition unit is also used for acquiring redundant resource information based on the fault information fault address, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; the sending unit is used for sending the fault information and the redundant resource information to the out-of-band controller so as to instruct the out-of-band controller to determine a repairing mode adopted for repairing the fault based on the redundant resource information; and the processing unit is used for repairing the fault based on the repairing mode.

In a sixth aspect, a memory failure repair device is provided, including: functional units for performing any of the methods provided in the third aspect, the actions performed by the respective functional units are implemented by hardware or by hardware executing corresponding software. For example, the apparatus may include: a receiving unit, a determining unit and a transmitting unit. The receiving unit is used for receiving fault information and redundant resource information, wherein the fault information comprises a fault address of a fault, and the redundant resource information indicates whether redundant resources corresponding to the fault exist or not; the determining unit is used for determining a repairing mode adopted by repairing the fault based on the redundant resource information; and the sending unit is used for indicating the processor to repair the fault based on the repair mode.

In a seventh aspect, a processor is provided, configured to obtain fault information, where the fault information includes a fault address of a fault; acquiring redundant resource information based on the fault address fault information, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; and when the redundant resource information indicates that the redundant resource corresponding to the fault exists, repairing the fault by using the redundant resource.

In an eighth aspect, a computing device is provided, including a processor and an out-of-band controller, the processor being electrically connected to the out-of-band controller, the processor configured to obtain fault information, the fault information including a fault address of a fault; acquiring redundant resource information based on the fault address fault information, wherein the redundant resource information indicates whether redundant resources corresponding to faults exist or not; the processor is also used for sending the fault information and the redundant resource information to the out-of-band controller; the out-of-band controller is used for receiving fault information and redundant resource information; the out-of-band controller is also used for determining a repairing mode adopted for repairing the fault; the out-of-band controller is also used for sending a repairing mode adopted by repairing the fault to the processor; and the processor is also used for receiving and repairing the fault based on the repairing mode.

In a ninth aspect, there is provided a chip comprising: a processor and interface circuit; the interface circuit is used for receiving the code instruction and transmitting the code instruction to the processor; a processor for executing code instructions to perform any one of the methods provided in the first to third aspects.

In a tenth aspect, there is provided a computer readable storage medium comprising computer executable instructions which, when run on a computer, cause the computer to perform any one of the methods provided in the first to third aspects.

In an eleventh aspect, there is provided a computer program product comprising computer-executable instructions which, when run on a computer, cause the computer to perform any one of the methods provided in the first to third aspects.

Technical effects caused by any implementation manner of the fourth aspect to the eleventh aspect may be referred to technical effects caused by corresponding implementation manners of the first aspect to the third aspect, and are not described herein.

Drawings

FIG. 1 is a system architecture diagram provided in an embodiment of the present application;

fig. 2 is a flow chart of a memory failure repair method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a DDR5 memory structure according to an embodiment of the present application;

fig. 4 is a flow chart of a memory failure repair method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a processor according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a processor according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a composition of an out-of-band controller according to an embodiment of the present application.

Detailed Description

In the description of the present application, "/" means "or" unless otherwise indicated, for example, a/B may mean a or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. Furthermore, "at least one" means one or more, and "a plurality" means two or more. The terms "first," "second," and the like do not limit the number and order of execution, and the terms "first," "second," and the like do not necessarily differ.

In this application, the terms "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

As shown in fig. 1, a system architecture diagram is provided in an embodiment of the present application. The system architecture diagram is an architecture diagram of a computing device. Referring to fig. 1, the hardware portion of the computing device includes a processor, an out-of-band controller, and a memory, and the software portion mainly includes an out-of-band management module, processor firmware, and an Operating System (OS) management unit. The OS management unit is located in the processor, the out-of-band management module is located in the out-of-band controller, and the processor firmware may be located in the processor (as shown in fig. 1), or the processor firmware may be located in a firmware chip (not shown in fig. 1) outside the processor. The processor in the computing device is mainly used for meeting the service requirement of the user, and can also be understood as a general central processing unit (central processing unit, CPU), and the out-of-band controller is mainly used for realizing the monitoring and maintenance of the computing device. The processor and the out-of-band controller may be collectively referred to as a processor.

Processor firmware may also be referred to as processor firmware programs. Specifically, the processor Firmware includes Firmware such as Firmware, basic input output system (basic input output system, BIOS), manageability engine (management engine, ME), microcode or intelligent management unit (intelligent management unit, IMU), and the like. The BIOS is software which is operated firstly after the computing device is started, and is used for setting hardware when the computing device is started, so as to prepare for the operation of the OS management unit.

It should be noted that the embodiments of the present application are not limited to the specific form of the processor firmware, and the above are only exemplary descriptions.

The out-of-band management module may be a management unit of a non-business module. For example, an out-of-band management module, which may be completely independent of the operating system of the computing device, may communicate with the processor firmware and the OS management unit (or OS management unit) through an out-of-band management interface of the out-of-band controller, may remotely maintain and manage the computing device via a dedicated data channel. By way of example, the out-of-band management module may include a management unit for the operating state of the computer device, a management system in a management chip external to the processor, a baseboard management controller (baseboard management controller, BMC), a system management module (system management mode, SMM), and the like. Wherein the BMC may check or upgrade software in the processor if the computing device is not booted.

It should be noted that, the embodiments of the present application are not limited to the specific form of the out-of-band management module, and the above is merely exemplary. In the following embodiments, only the out-of-band management module is taken as a BMC for illustration.

It should be noted that the above-mentioned out-of-band management module and part of the management unit or module and firmware included in the processor firmware are only examples. In fact, part of the management unit may also run in the computing device as a processor firmware program, e.g. SMM may also provide business services for the user, performing the relevant functions of the BIOS. Similarly, some of the processor firmware may also perform BMC related functions as management units for non-business modules, such as ME, IMU, etc.

The memory, also called an internal memory or a main memory, is mainly used for temporarily storing operation data in a computing device and exchange data of an external memory (external memory for short) such as a hard disk. The memory and the memory controller in the processor communicate via a memory channel (channel). Currently, memory is typically installed in memory slots on a motherboard of a computing device in the form of memory banks, where each slot in a slot may be used as an identifier of a memory bank to indicate a plugged memory bank. When the memory fails, the memory bank in the inserting groove can be repaired or replaced.

The memory has at least one memory rank (rank), each memory rank is located on a surface of the memory, each memory rank includes at least one sub-memory rank (sub-memory rank), the memory rank or sub-memory rank includes a plurality of memory chips (devices), each memory chip is divided into a plurality of memory array groups (bank groups), each memory array group includes a plurality of memory arrays (bank), each memory array is divided into a plurality of memory cells (cells), each memory cell has a row (row) address and a column (column) address, and each memory cell includes one or more bits. That is, one memory cell may be located on the memory array whenever a row (row) and a column (column) on the memory array are specified. The minimum unit of the memory failure is a memory cell on the memory array. When at least two memory cells of the same row and different columns fail in one memory array, then a row failure in the memory array can be determined.

It will be appreciated that the number of memory cells in the same row and different columns that constitute a row fault is not limited, and a first threshold may be preset in the computing device, and when the first threshold is greater than or equal to the first threshold, the memory cells are considered to constitute a row fault.

The embodiments of the present application are not limited in any way to the specific form of computing device. For example, the computing device may be a terminal apparatus or a network device. Wherein the terminal device may be referred to as: a terminal, user Equipment (UE), a terminal device, an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, a user equipment, or the like. The terminal device may be a mobile phone, an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, a tablet, a notebook, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), or the like. The network device may be a server or the like in particular. The server may be a physical or logical server, and is not limited thereto.

The method provided in the embodiment of the present application is applicable to, but not limited to, dynamic random access memory (dynamic random access memory, DRAM), static random access memory (static random access memory, SRAM) and other memories, and the method in the embodiment of the present application is not limited to the type of memory.

It should be noted that the structure shown in fig. 1 does not constitute a limitation of the computing device, and the computing device may include more or less components than those shown in fig. 1, or may combine some components, or may be arranged in different components.

It should be noted that, the system architecture and the application scenario described in the embodiments of the present application are for more clearly describing the technical solution of the embodiments of the present application, and do not constitute a limitation on the technical solution provided in the embodiments of the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiments of the present application is equally applicable to similar technical problems.

Currently, most computing devices support checking memory and correcting errors that are detected, i.e., repairing faults in the memory. For example, each time the memory performs a read-write task, an error checking and correcting (error checking and correcting, ECC) method is used to identify faults in the memory. The ECC method is used for identifying errors when fewer bits in the memory fail. Errors that can be corrected are referred to as Corrected Errors (CEs), and may also be referred to as corrected faults. If the capability of the error correction algorithm is exceeded, for example, when there is a wide range of multi-bit failures in memory, the error correction fails, resulting in uncorrectable errors (uncorrected error, UCE), which may also be referred to as uncorrectable failures. When UCE is generated, the computing device may fail seriously, e.g., crash, resulting in loss of data in the memory.

In order to avoid UCE, when the memory fails, the BIOS reports the failure information to the BMC, and the BMC determines the repair mode, thereby realizing timely repair of the memory failure. When the memory fails, the BMC can determine to adopt a PPR repair mode to instruct the BIOS to replace the failed resource by utilizing the redundant row resource in the memory, thereby realizing repair. However, in the current repair process, the computing device cannot learn the remaining condition of the PPR redundancy resource, so that when the PPR redundancy resource is used up, the fault repair cannot be completed through the PPR, and the repair fails, thereby affecting the normal operation of the computing device.

The application provides a memory fault repairing method, as shown in fig. 2, comprising steps S201-S203.

S201, the computing device acquires fault information.

The fault information comprises a fault address of a fault. Illustratively, the failure address includes a row address and a column address of at least one memory cell.

In one possible implementation, when a fault occurs in a memory in the computing device, the fault address is reported to the processor for indicating the address of the fault in the memory.

Optionally, the computing device determines the failure granularity based on the failure address in the failure information. The fault granularity corresponds to the division mode of the memory resources. Illustratively, the granularity of faults includes cell faults, row faults, rank faults, and the like.

For example, a cell failure in memory may be considered to occur when the failure address indicates one or more memory cells in different columns and rows. Alternatively, a row failure in memory may be considered to occur when the failure address indicates multiple storage units in different columns of the same row.

Alternatively, the above-described failure granularity may also be determined in conjunction with the number of storage units. For example, a row failure is established if the number of multiple memory cells in different columns of the same row satisfies a first threshold. Similarly, a rank failure in the failure granularity holds true if the number of failed storage units meets the number of overall storage units to the ratio threshold.

In the step S201, in the computing device shown in fig. 1, the fault information may be specifically reported to the processor from the memory; further, the processor may also forward the fault information to the BMC.

And S202, the computing equipment acquires redundant resource information based on the fault address fault information.

Wherein the redundant resource information indicates whether there is a redundant resource corresponding to the failure.

The redundant resource information takes the banks as granularity and represents the residual condition of the corresponding redundant resources of each bank. In combination with the above memory resource partitioning method, there are multiple storage rows in the banks, and one or more redundant rows are configured for each bank, and when a storage row in the bank fails, the redundant rows can be replaced to repair the row failure in the bank.

In one possible implementation, the computing device obtains redundant resource information in the memory, and determines whether a redundant resource corresponding to the failure exists based on the failure address. Specifically, the redundant resource information is stored in a register in the memory, and the computing device reads the redundant resource information recorded by each register. Further, according to the fault address, the remaining condition of redundant resources of the bank to which the fault belongs is determined.

In another possible implementation, the computing device obtains corresponding redundant resource information based on the failure granularity. Illustratively, when the failure granularity is a row failure, the redundant resource information is redundant row resource information.

In the memory, in order to avoid that faults occur in the memory to affect the overall operation of the computing device, for a row fault in the memory, redundant row resources are usually preset for replacing the memory resources with the row fault, and the repairing mode is PPR. For example, the processor modifies the read-write operation for the failed address to the read-write operation for the address of the redundant row resource, thereby achieving the repair of the memory failure.

Specifically, the redundant resource information is stored in a register in the memory. The computing device reads the corresponding register based on the fault address in the fault information to obtain redundant resource information. The fault address includes a row address and a column address of at least one memory cell, and devices, a bank group, and a bank to which the row address and the column address belong.

In a practical application scenario, based on the memory technology standard, the memory has been developed to the 5 th generation (double rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory), referred to as DDR5 for short). Wherein, the number of the bank groups is at most 8, each bank group comprises 4 banks, and an 8-bit (bit) register can be used for recording redundant resource information of 8 banks in two bank groups. For example, each bit is used to indicate redundant resource information of one bank, "0" indicates that there is no redundant resource, and "1" indicates that there is a redundant resource.

Exemplary, as shown in fig. 3, a schematic diagram of a DDR5 memory architecture is shown. Wherein, include bank group0 ~ 7, include bank 0 in each bank group, bank 1, bank 2 and bank3 respectively. Wherein, an 8-bit register can be used for recording redundant resource information of each bank in two bank groups. For example, banks group0 to 7 in fig. 3 record redundant resource information through 4 registers, which are respectively Mode Registers (MR) 54, MR55, MR56, and MR57, wherein MR54 is used to record replacement resource information in each of banks group0 to 1. For example, "1000 0000" indicates that bank 0 in bank group0 has redundant resources, while the remaining banks do not have redundant resources.

It should be understood that the DDR5 memory architecture shown in fig. 3 is only an example, and the resource distribution and division may be adjusted according to practical applications, which is not limited in this application. In addition, the redundant resource information described above may be stored in other ways, or in more or fewer registers.

By the method, the method is beneficial to determining the bank to which the fault belongs based on the fault address and reading and recording the redundant resource information of the register corresponding to the bank, so that the redundant resource information is acquired as required, the computing resource is saved, and the information reading efficiency is improved.

Optionally, the redundant resource information includes a remaining amount of redundant resources corresponding to the bank. In an exemplary 8-bit register, 4 bits represent a bank identifier, and the other 4 bits identify the remaining amount of redundant resources corresponding to the bank identifier. In this way, when the remaining amount of redundant resources is not zero, it may be indicated that there are redundant resources for repairing a row failure.

In combination with the computing device shown in fig. 1, step S202 may specifically be implemented by the processor or the BIOS running in the processor, based on the fault information, reading redundant resource information recorded by the register in the memory; or, after the processor feeds back the fault information to the BMC, the BMC instructs the processor or the BIOS running in the processor to read the redundant resource information recorded in the register in the memory, and then forward the redundant resource information to the BMC.

And S203, when the redundant resource information indicates that the redundant resource corresponding to the fault exists, repairing the fault by using the redundant resource.

It will be appreciated that when a failure occurs in memory, the CPU will not be able to read or write data to the one or more memory cells corresponding to the failure. The fault repairing by adopting the redundant resources specifically means that the mapping relation of the CPU to the storage resources with faults is adjusted to the mapping relation of the CPU to the redundant resources, so that the subsequent CPU can repair the memory fault by reading and writing the redundant resources. The mapping relationship may be the position information of the storage unit corresponding to the read or write data of the CPU.

And when the fault granularity of the fault is a row fault, repairing the fault by using the PPR. The PPR includes a first repair style (soft PPR, abbreviated as sPPR) and a second repair style (hard PPR, abbreviated as hPPR). The redundant resources in the first repair mode are recyclable redundant resources, and the redundant resources in the second repair mode are non-recyclable redundant resources.

It will be appreciated that the PPR achieves failover by replacing the address of the row failure with the address of the redundant resource. When the first repairing mode is adopted for fault repairing, after the computing equipment is restarted, the association relation between the replaced addresses is canceled, so that the redundant resources are recovered. It is also understood that the redundant resource will no longer be used to replace a previously failed resource and may then be used again for repair of other failures. When the second repairing mode is adopted for fault repairing, whether the computing equipment is restarted or not does not influence the association relation between the replaced addresses, and accordingly, redundant resources for repairing cannot be recovered.

In one possible implementation, the row fault is repaired in the computing device according to a preset first repair mode or a second repair mode. Specifically, when a fault occurs, if the computing device is preset to perform fault repair according to the first repair mode, the redundant resource will be recovered after the subsequent computing device is restarted. The recovered redundant resources can be repaired again aiming at the re-reported faults. If the computing equipment is preset to carry out fault repair according to the second repair mode, after the computing equipment is restarted, the redundant resource is utilized to carry out fault repair, and the redundant resource is used for permanently replacing the address where the fault is located.

In another possible implementation manner, the row fault is repaired according to a preset first repair mode in the computing device, and after the BIOS is restarted in the computing device, the row fault is repaired again by adopting a second repair mode. Specifically, when a failure occurs, the computing device performs failure repair in a first repair manner, and the computing device stores a failure address of the failure. And after restarting the computing equipment, performing fault restoration again by adopting a second restoration mode according to the stored fault address.

Optionally, the first repairing mode and the second repairing mode may be flexibly selected by the computing device. For example, the computing device determines the severity of the fault, and in the case that the severity is greater than the preset severity, the second repair mode may be adopted, so that the first repair mode is avoided, and when the device is restarted, the fault affects the operation of the computing device.

In connection with the computing device shown in fig. 1, the repairing method in step S203 is generally preset in the BIOS of the processor, and the BIOS performs the fault repairing. And determining whether the fault repair is performed by the BIOS by the processor according to the redundant information, or indicating whether the fault repair is performed by the processor by the BIOS by the BMC based on the redundant information.

At present, a first repair mode or a second repair mode preset in the BIOS is adopted to repair the memory fault, however, after the BIOS is restarted, the redundant resources for repairing the memory fault aiming at the first repair mode are recovered, and the fault still exists after the BIOS is restarted. According to the technical scheme provided by the embodiment of the application, the sPPR is adopted preferentially to repair the fault, and after restarting, the hPR is adopted to repair the fault again, so that the influence of the memory fault on the operation of the computing equipment after restarting is avoided, the flexible configuration of the repair mode is realized, and the fault repair efficiency is improved.

Optionally, when the computing device determines that there are no redundant resources based on the redundant resource information, performing an adaptive dual device data correction (adaptive double device data correction, ADDDC) sparing or a block cache line redundancy replacement (partial cache line sparing, PCLS) memory repair approach for the failure.

The ADDC sparing is used for repairing the rank or bank granularity faults, and the PCLS is used for replacing the faulty storage unit by the redundant storage unit.

It should be noted that, in the practical application, when the granularity of the fault is smaller than that of the row fault, the fault repair can also be performed by adopting the above method, which is not limited. For example, when a single storage unit fails, the failure repair can be implemented by replacing the storage row in which the single storage unit is located.

By the method, other repairing modes of the computing equipment when the fact that the PPR cannot be used for repairing the memory faults is judged are provided, so that failure of repairing faults under the condition that the PPR cannot be used for repairing is avoided, operation of the computing equipment is affected, and fault repairing efficiency is improved.

In step S202, in conjunction with the computing device shown in fig. 1, the BMC in the computing device may specifically determine a repair mode, and instruct the BIOS to execute repair based on the repair mode.

In practical application, the processor or the out-of-band controller may directly determine the repair mode according to the replacement resource information after obtaining the replacement resource information, and perform repair, which is not limited.

Optionally, the step S203 specifically includes: and the computing equipment determines the severity degree of the fault according to the fault address, and when the severity degree of the fault is greater than the preset severity degree, the fault is repaired by using the redundant resource.

Wherein the severity of the fault is used to indicate the probability of the fault transitioning to UCE.

It can be understood that, under normal circumstances, the memory repairs a faulty storage unit, if the repair is successful, the fault is marked as CE, and if there is a fault in another storage unit in the same row as the storage unit, then, in order to avoid the fault of the storage unit being converted from CE to UCE, the memory may consider that the storage unit in the same row is replaced by a redundant resource in a repair manner of PPR, so as to solve the problem that a plurality of storage units have faults.

In one possible implementation, when the failure is a row failure, the severity of the failure is determined based on the number of failed storage units in the same row address. Specifically, when the number of failed storage units in the same row address is greater than the second threshold, the severity of the failure is considered to be greater than the preset severity. That is, the preset severity is represented by the second threshold described above.

Further, when determining whether to form a row fault based on the relationship between the number of storage units in different columns of the same row and the first threshold, the second threshold is greater than or equal to the first threshold, that is, the severity of the row fault can be considered to meet the preset severity under the condition of meeting the formation of the row fault, or further, when the number of storage units in different columns of the same row is greater than the second threshold under the condition of meeting the formation of the row fault, the severity of the row fault is considered to meet the preset severity.

In another possible implementation, when the failure is a row failure, the severity of the failure is determined based on the number of failures of storage units in different columns of the same row. Specifically, the computing device obtains the fault occurrence time of the fault, determines the occurrence times of the fault within the preset time according to the fault occurrence time, and considers that the severity of the line fault is greater than the preset degree when the occurrence times of the fault is greater than a third threshold.

It should be noted that the above possible implementations are merely examples, and the computing device may also determine the severity of the fault by other means or other information of the fault, which is not limited.

Optionally, when the computing device determines that the severity of the fault is less than or equal to the preset severity, executing an ADDDC sparing or PCLS memory repair mode for the fault.

In combination with the computing device shown in fig. 1, in practical application, the severity of the fault may be determined by the processor, or the severity of the fault may also be determined by the BMC in the computing device, taking the fault as an example, and when the severity is determined to be greater than the preset severity, and when there is an alternative redundant resource, the BIOS is instructed to repair the fault with PPR. When the BMC determines that the severity of the fault is less than or equal to the preset severity, or the BMC determines that no alternative redundant resource exists, the BIOS is instructed to adopt an ADDC sparing or PCLS memory repair mode for the line fault.

As shown in fig. 4, a flowchart of another memory fault repair method provided in the present application is applied to a processor and an out-of-band controller in a computing device, and specifically, a BIOS is executed in the processor, and the method includes the following steps S401 to S406.

S401, the processor acquires fault information.

The fault information comprises a fault address of a fault.

S402, the processor acquires redundant resource information based on the fault address.

Wherein the redundant resource information indicates whether a redundant resource corresponding to the failure exists.

In the above steps S401 to S402, the specific implementation manner of the processor to acquire the failure information and the redundant resource information may refer to the above description about steps S201 to S202.

S403, the processor sends fault information and redundant resource information to the out-of-band controller.

Accordingly, the out-of-band controller receives the failure information and the redundant resource information.

S404, the out-of-band controller determines a repairing mode adopted for repairing the fault based on the redundant resource information.

In a first possible implementation, the repair mode is determined according to the remaining situation of the redundant resource indicated by the redundant resource information of the out-of-band controller. Specifically, if the redundant resource information indicates that the redundant resource corresponding to the line fault exists, the out-of-band controller determines to repair by adopting the redundant resource; if the redundant resource information indicates that the redundant resource corresponding to the line fault does not exist, the out-of-band controller determines to repair by adopting an memory repair mode of ADDC sparing or PCLS.

In a second possible implementation, the out-of-band controller determines the severity of the fault based on the fault address; and when the severity is greater than the preset severity, the out-of-band controller determines to repair the fault by using the redundant resource.

In a third possible implementation manner, the out-of-band controller determines the residual quantity of the redundant resource according to the redundant resource information; and when the residual quantity of the redundant resources is larger than a preset threshold value, the out-of-band controller determines to repair the fault by using the redundant resources.

In combination with the description about the redundant resource information, the out-of-band controller can acquire the residual quantity of the redundant resource according to the redundant resource information; or, the out-of-band controller may further instruct the processor to acquire the remaining amount of the redundant resource from the memory when determining that the redundant resource exists according to the redundant resource information.

Optionally, the second possible implementation manner and the third possible implementation manner are comprehensively considered to determine the repairing manner. Specifically, under the condition that the severity is greater than a preset severity, whether the residual quantity of the redundant resources is greater than a preset threshold or not, the out-of-band controller indicates to repair by using the redundant resources; under the condition that the severity is smaller than or equal to the preset severity and the residual quantity of the redundant resources is larger than a preset threshold, the out-of-band controller indicates to repair by adopting the redundant resources; and under the condition that the severity is smaller than or equal to the preset severity and the residual quantity of the redundant resources is smaller than or equal to the preset threshold, the out-of-band controller indicates to repair by adopting an ADDC (add direct current) sparing or PCLS (PCLS) memory repair mode.

It can be appreciated that, through the implementation manner provided above, the out-of-band controller is facilitated to consider whether to repair the line fault by using the redundant resource based on different conditions, so as to improve flexibility of the scheme, and according to the severity and the residual quantity of the redundant resource, the fault with high severity is repaired preferentially, and the fault with low severity can be repaired under the condition that the residual quantity of the redundant resource is more, or other repair modes are adopted for repairing, so that rationality of the scheme is improved.

S405, the out-of-band controller sends the repairing mode to the processor.

Accordingly, the processor receives the repair pattern.

S406, the processor repairs the fault based on the repair mode.

Optionally, the processor repairs the fault based on the first repair mode or the second repair mode. Or the processor preferably uses the first repairing mode to repair, and after restarting the BIOS, adopts the second repairing mode to repair the first repairing mode.

By the method, the out-of-band controller is facilitated to determine different fault repairing modes based on different scenes, so that the flexibility of fault repairing is improved, and the fault repairing efficiency is improved.

The foregoing description of the embodiments of the present application has been presented primarily from a method perspective. It is to be understood that the computing device, in order to implement the above-described functionality, includes at least one of corresponding hardware structures and software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application may divide the functional units of the computing device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

Fig. 5 shows a possible structural diagram of the processor involved in the above-described embodiment in the case where respective functional blocks are divided with corresponding respective functions. As shown in fig. 5, the processor 50 includes an acquisition unit 501 and a processing unit 502.

An obtaining unit 501, configured to obtain fault information, where the fault information includes a fault address of a fault.

The obtaining unit 501 is further configured to obtain redundant resource information based on the failure address, where the redundant resource information indicates whether a redundant resource corresponding to the failure exists.

And the processing unit 502 is configured to repair the fault by using the redundant resource when the redundant resource information indicates that the redundant resource corresponding to the fault exists.

In one example, the obtaining unit 501 is specifically configured to determine a fault granularity of the fault based on the fault address; wherein the failure granularity comprises a row failure; redundant resource information is obtained based on the failure granularity.

In one example, the processing unit 502 is specifically configured to repair the fault using the redundant resource in a first repair manner; the redundant resources in the first repair mode are recyclable redundant resources.

In one example, the processing unit 502 is further configured to restart a basic input output system BIOS; adopting a second repairing mode, and repairing the fault by using redundant resources; the redundant resources in the second repair mode are non-recoverable redundant resources.

In one example, the fault information includes a fault address of the fault, and the processing unit 502 is specifically configured to determine a severity of the fault according to the fault address; and when the severity is greater than the preset severity, repairing the fault by using the redundant resource.

In one example, the processing unit 502 is specifically configured to determine a remaining amount of the redundant resource according to the redundant resource information; and when the residual quantity of the redundant resources is larger than a preset threshold value, repairing the fault by using the redundant resources.

In one example, the processor 50 further includes a storage unit 503. The storage unit 503 is configured to store computer-executable instructions, and other units in the processor may perform corresponding actions according to the computer-executable instructions stored in the storage unit 503.

For a specific description of the above alternative modes, reference may be made to the foregoing method embodiments, and details are not repeated here. In addition, any explanation and description of the beneficial effects of the processor 50 provided above may refer to the corresponding method embodiments described above, and will not be repeated.

Fig. 6 shows a possible structural diagram of the processor involved in the above-described embodiment in the case where respective functional blocks are divided with corresponding respective functions. As shown in fig. 6, the processor 60 includes an acquisition unit 601, a transmission unit 602, and a processing unit 603.

An obtaining unit 601, configured to obtain fault information, where the fault information includes a fault address of a fault; the obtaining unit 601 is further configured to obtain redundant resource information based on the failure address, where the redundant resource information indicates whether a redundant resource corresponding to the failure exists; a sending unit 602, configured to send the fault information and the redundant resource information to the out-of-band controller, so as to instruct the out-of-band controller to determine a repair mode adopted for repairing the fault based on the redundant resource information; the processing unit 603 is configured to repair the fault based on the repair mode.

In one example, the processing unit 603 is specifically configured to repair the fault by using the redundant resource in the first repair manner, where the redundant resource in the first repair manner is a recoverable redundant resource.

In one example, the processing unit 603 is further configured to restart the basic input output system BIOS; adopting a second repairing mode, and repairing the fault by using redundant resources; the redundant resources in the second repair mode are non-recoverable redundant resources.

In one example, the processor 60 also includes a storage unit 604. The storage unit 604 is configured to store computer-executable instructions, and other units in the processor may perform corresponding actions according to the computer-executable instructions stored in the storage unit 604.

Fig. 7 shows a possible structural diagram of the out-of-band controller involved in the above embodiment in the case where respective functional blocks are divided with corresponding respective functions. As shown in fig. 7, the out-of-band controller 70 includes a receiving unit 701, a determining unit 702, and a transmitting unit 703.

A receiving unit 701, configured to receive failure information and redundant resource information, where the failure information indicates that a failure occurs in the memory, and the redundant resource information indicates whether a redundant resource corresponding to the failure exists; a determining unit 702, configured to determine a repair mode adopted for repairing the fault based on the redundant resource information; and a sending unit 703, configured to instruct the processor to repair the fault based on the repair mode.

In one example, the fault information includes a fault address of the fault, and the determining unit 702 is specifically configured to determine a severity of the fault according to the fault address; and when the severity is greater than the preset severity, determining to repair the fault by using the redundant resource.

In one example, the determining unit 702 is specifically configured to determine, according to the redundant resource information, a remaining amount of the redundant resource; and when the residual quantity of the redundant resources is larger than a preset threshold value, determining to repair the fault by using the redundant resources.

In one example, the out-of-band controller 70 further includes a storage unit 704. The storage unit 704 is configured to store computer-executable instructions, and other units in the out-of-band controller may perform corresponding actions according to the computer-executable instructions stored in the storage unit 704.

For a specific description of the above alternative modes, reference may be made to the foregoing method embodiments, and details are not repeated here. In addition, any explanation and description of the beneficial effects of the management controller provided above may refer to the corresponding method embodiments described above, and will not be repeated.

Embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method performed by any one of the computing devices provided above.

For the explanation of the relevant content and the description of the beneficial effects in any of the above-mentioned computer-readable storage media, reference may be made to the above-mentioned corresponding embodiments, and the description thereof will not be repeated here.

The embodiment of the application also provides a chip. The chip has integrated therein control circuitry and one or more ports for implementing the functions of the computing device described above. Optionally, the functions supported by the chip may be referred to above, and will not be described herein. Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments may be implemented by a program to instruct associated hardware. The program may be stored in a computer readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an application specific integrated circuit (application specific integrated circuit, ASIC), a microprocessor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the methods of the above embodiments. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), etc.

It should be noted that the above-mentioned devices for storing computer instructions or computer programs, such as, but not limited to, the above-mentioned memories, computer-readable storage media, communication chips, and the like, provided in the embodiments of the present application all have non-volatility (non-transparency).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Although the present application has been described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the figures, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Although the present application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A memory failure repair method, the method comprising:

obtaining fault information, wherein the fault information comprises a fault address, and the fault address is used for indicating the position of a fault in a memory;

acquiring redundant resource information based on the fault address, wherein the redundant resource information indicates whether redundant resources corresponding to the fault exist or not;

and when the redundant resource information indicates that the redundant resource corresponding to the fault exists, repairing the fault by using the redundant resource.

2. The method of claim 1, wherein the obtaining redundant resource information based on the failed address comprises:

determining a fault granularity of the fault based on the fault address; wherein the failure granularity comprises a row failure;

and acquiring redundant resource information based on the fault granularity.

3. The method according to claim 1 or 2, wherein when the redundant resource information indicates that there is a redundant resource corresponding to the failure, repairing the failure using the redundant resource includes:

repairing the fault by using the redundant resource in a first repairing mode; the redundant resource in the first repair mode is a recoverable redundant resource.

4. A method according to any one of claims 1-3, wherein the method further comprises:

restarting a basic input output system BIOS;

repairing the fault by using the redundant resource in a second repairing mode; and the redundant resources in the second repair mode are unrecoverable redundant resources.

5. The method according to any one of claims 1-4, wherein the fault information, fault address, when the redundant resource information indicates that there is a redundant resource corresponding to the fault, repairing the fault using the redundant resource, includes:

determining the severity of the fault according to the fault address;

and when the severity is greater than a preset severity, repairing the fault by using the redundant resource.

6. The method according to any one of claims 1-5, wherein when the redundant resource information indicates that there is a redundant resource corresponding to the failure, repairing the failure using the redundant resource includes:

determining the residual quantity of the redundant resources according to the redundant resource information;

and when the residual quantity of the redundant resources is larger than a preset threshold value, repairing the fault by using the redundant resources.

7. A memory fault repair method, applied to a processor, comprising:

the method comprises the steps that a processor acquires fault information, wherein the fault information comprises a fault address of a fault;

the processor acquires redundant resource information based on the fault address fault information, wherein the redundant resource information indicates whether redundant resources corresponding to the fault exist or not;

the processor sends the fault information and the redundant resource information to an out-of-band controller so as to instruct the out-of-band controller to determine a repairing mode adopted for repairing the fault based on the redundant resource information;

the processor repairs the fault based on the repair mode.

8. The method of claim 7, wherein the processor obtaining redundant resource information based on the failure address comprises:

the processor determines the fault granularity of the fault based on the fault address; wherein the failure granularity comprises a row failure;

the processor obtains redundant resource information based on the failure granularity.

9. The method of claim 7 or 8, wherein the processor repairing the fault based on the repair method comprises:

The processor adopts a first repairing mode, the fault is repaired by utilizing the redundant resources, and the redundant resources in the first repairing mode are recyclable redundant resources.

10. The method according to any one of claims 7-9, further comprising:

restarting a Basic Input Output System (BIOS) by the processor;

the processor adopts a second repairing mode, and the fault is repaired by utilizing the redundant resource; and the redundant resources in the second repair mode are unrecoverable redundant resources.

11. A memory failure repair method, applied to an out-of-band controller, comprising: the out-of-band controller receives fault information and redundant resource information, wherein the fault information comprises a fault address of a fault, and the redundant resource information indicates whether redundant resources corresponding to the fault exist or not;

the out-of-band controller determines a repairing mode adopted for repairing the fault based on the redundant resource information;

the out-of-band controller instructs the processor to repair the fault based on the repair mode.

12. The method of claim 11, wherein the out-of-band controller determining a repair mode to repair the failure based on the redundant resource information includes:

The out-of-band controller determines the severity of the fault according to the fault address;

and when the severity is greater than a preset severity, the out-of-band controller determines to repair the fault by using the redundant resource.

13. The method according to claim 11 or 12, wherein the out-of-band controller determining a repair mode adopted to repair the fault based on the redundant resource information comprises:

the out-of-band controller determines the residual quantity of the redundant resources according to the redundant resource information;

and when the residual quantity of the redundant resources is larger than a preset threshold value, the out-of-band controller determines to repair the fault by adopting the redundant resources.