CN115168088A

CN115168088A - Method and device for repairing uncorrectable errors of memory

Info

Publication number: CN115168088A
Application number: CN202210797617.5A
Authority: CN
Inventors: 张光彪; 鲍全洋; 曹瑞; 韦炜玮
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: Henan Kunlun Technology Co ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-10-11

Abstract

The application provides a method for repairing uncorrectable errors of a memory, which comprises the following steps: acquiring fault information of the memory, wherein the fault information comprises uncorrectable error information of the memory, and when the fault is determined to be an uncorrectable error storm based on the uncorrectable error information; acquiring service load information, wherein the service load information comprises service load and/or service load increment; and determining to repair the fault of the memory based on the service load and/or the service load increment. The method for repairing the uncorrectable errors of the memory considers the condition of the service load of the system, so that the repairing time of the routing inspection UCE is more reasonable, and the probability of system downtime caused by less-reasonable use of self-healing resources is reduced.

Description

Method and device for repairing uncorrectable errors of memory

Technical Field

The present application relates to the field of server technologies, and in particular, to a method and an apparatus for repairing an uncorrectable error in a memory.

Background

A Dynamic Random Access Memory (DRAM) is a type of random access memory that has wide applications in the storage and IT fields. As the integration of DRAM is higher and the manufacturing process is smaller, the basic failure rate is higher. Memory failures have become one of the important causes of server downtime.

The mainstream server system generally has a reliability, availability, and serviceability (RAS) mechanism of memory polling, and for memories with higher and higher basic failure rates, the probability of occurrence of a polling Correctable Error (CE) or a polling uncorrectable Error (UCE) is continuously increased. Memory typically fails due to UCE, which in turn causes the server to go down.

Disclosure of Invention

The embodiment of the application provides a method for repairing uncorrectable errors of a memory, which considers the system service load condition on the basis of solving the problem of routing inspection of UCE faults, so that the repair time of routing inspection of UCE is more reasonable, and the probability of system downtime caused by less-reasonable use of self-healing resources is reduced.

In a first aspect, an embodiment of the present application provides a method for repairing an uncorrectable error of a memory, including obtaining fault information of the memory, where the fault information includes uncorrectable error information of the memory; based on the uncorrectable error information, when the fault is determined to be an uncorrectable error storm, acquiring service load information, wherein the service load information comprises a service load and/or a service load increment; and determining to repair the fault of the memory based on the service load and/or the service load increment.

The method for repairing the uncorrectable errors of the memory, provided by the embodiment of the application, considers the condition of the service load of the system, so that the repairing time of the inspection UCE is more reasonable, and the probability of system downtime caused by less-reasonable use of self-healing resources is reduced.

In one possible implementation, determining to repair the fault based on the load information includes: and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.

In another possible implementation, determining to repair the fault based on the load information includes: and when the traffic load increase is larger than or equal to a second threshold value, determining to repair the fault.

In the possible implementation, when the traffic load amount or the traffic load increase amount is greater than the preset value, a decision is made to repair the fault, for example, a self-healing mechanism request for routing inspection of the fault corresponding to the UCE is executed, so that the BIOS repairs the fault. The method has the advantages that the self-healing of the routing inspection UCE faults is combined with the real-time service load capacity or the load increment, the determined potential serious faults of the memory can be repaired at a more proper time, self-healing resources are more properly utilized, and the subsequent system downtime probability is reduced.

In another possible implementation, the amount of traffic load is determined based on CPU occupancy and/or memory occupancy.

In another possible implementation, the amount of traffic load increase is determined based on an amount of increase in CPU occupancy and/or an amount of increase in memory occupancy.

In another possible implementation, the fault includes a first type of fault, and the first type of fault is repaired when the number of times of the first type of fault is greater than a third threshold, and the first type of fault includes at least one of a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.

That is to say, only the fault with the frequency greater than a certain threshold value in the inspection UCE storm is repaired to ensure that the fault is a real fault, so that unreasonable utilization of self-healing resources caused by accidental inspection soft faults can be effectively avoided, and the downtime probability of a subsequent system is further reduced.

In another possible implementation, the fault includes one or more of a Cell fault, row fault, col fault, bank fault, device fault, rank fault, or dim fault.

In another possible implementation, the method further includes determining an uncorrectable error, the number of which is greater than or equal to a preset number of times within a preset time length, as the patrol uncorrectable error storm, that is, defining a patrol UCE storm, where a phenomenon that a certain number of times or more of UCE occurs within a certain time is defined as the patrol UCE storm.

In a second aspect, an embodiment of the present application provides a processing apparatus, including,

the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring fault information of a memory, and the fault information comprises uncorrectable error information of the memory;

when the fault is determined to be an uncorrectable error storm based on the information for the uncorrectable error;

acquiring service load information, wherein the service load information comprises service load and/or service load increase;

and the determining module is used for determining to repair the fault based on the service load and/or the service load increment.

In one possible implementation, the determining module is specifically configured to:

and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.

In another possible implementation, the determining module is specifically configured to:

and when the service load increment is larger than or equal to a second threshold value, determining to repair the fault.

In another possible implementation, the traffic load amount is determined based on CPU occupancy and/or memory occupancy.

In another possible implementation, the traffic load increase is determined based on an increase in CPU occupancy and/or an increase in memory occupancy.

In another possible implementation, the processing device further includes:

and the setting module is used for determining the uncorrectable errors with the times larger than or equal to the preset times in the preset time length as the patrol uncorrectable error storm.

In a third aspect, an embodiment of the present application provides a computing device, including a memory and a processor, where the memory stores instructions, and when the instructions are executed by the processor, the method in the first aspect is implemented.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, causes the method of the first aspect to be implemented.

In a fifth aspect, the present application also provides a computer program or a computer program product, which includes instructions that, when executed, make a computer perform the method of the first aspect.

In a sixth aspect, an embodiment of the present application further provides a chip, which includes at least one processor and a communication interface, where the processor is configured to execute the method in the first aspect.

Drawings

Fig. 1a illustrates a hardware structure diagram of a computing device that can execute a method for repairing an uncorrectable error in a memory according to an embodiment of the present application;

fig. 1b is a schematic hardware structure diagram of another computing device that can execute the method for repairing an uncorrectable error in a memory according to the embodiment of the present application;

fig. 2 is a signaling interaction diagram of the computing device shown in fig. 1b in the process of implementing the method for repairing an uncorrectable error in a memory according to the embodiment of the present application;

fig. 3a is a flowchart of a method for repairing an uncorrectable error of a memory according to an embodiment of the present disclosure;

fig. 3b is a flowchart of another method for repairing an uncorrectable error of a memory according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solution of the present application is further described in detail by the accompanying drawings and examples.

For the convenience of understanding, the meanings of the technical terms referred to hereinafter in the text are briefly explained below.

Patrol (patroll screening): in the RAS technology in a computer operating system, a CPU reads the content in a memory when the CPU is idle, if the read data has correctable errors, the corrected data is rewritten into the memory, and the probability of uncorrectable errors is reduced by correcting single-bit errors in time.

Storm (Storm): a behavior is characterized by producing a certain amount in a certain time.

Patrol CE (control scattering Corrected Error): correctable errors identified by a polling mechanism.

Patrol UCE (control scratching unordered Error): uncorrectable errors identified by the polling mechanism typically do not result in interruption of server traffic or system downtime.

Patrol CE Storm (paterol scratching Corrected Error Storm): a certain number of correctable errors identified by the polling mechanism within a certain time.

Patrol UCE Storm (facial scratching unordered Error Storm): within a certain time, a certain number of uncorrectable errors are identified by the polling mechanism, and such errors generally do not directly cause interruption of server traffic or system downtime.

Read CE (Read Corrected Error): correctable errors are produced when the CPU consumes memory data.

The real UCE (True Uncorrected Error) is an uncorrectable Error generated when the CPU consumes memory data, including an uncorrectable Error (UCNA) that is not activated and a Software Recoverable Action Required (SRAR), and such uncorrectable Error generally causes a service interruption of the server.

Failure (pattern): refers to some form or manifestation of failure behavior.

In order to solve the problem of server downtime caused by memory failure, the first scheme is as follows: the UCE information of the memory polling is collected by a Base Management Controller (BMC) of the out-of-band management, and the UCE information of the memory polling is recorded in an FDM log.

However, in the scheme, only the memory patrol UCE information is recorded, and the fault repair executed on the memory is not performed, so that the real fault still exists in the memory, and when the service load is increased, once the fault address of the memory is used, the risk of causing the interruption of the service of the server system is increased.

The second scheme is as follows: and collecting fault information by the out-of-band management BMC, defining a fault characteristic mode and a repair request according to partial information of the routing inspection error, and sending the fault characteristic mode and the repair request to the BIOS to execute a corresponding isolation mechanism so as to realize prediction and self-healing of the routing inspection error.

The scheme does not consider the condition of system service load, when a large amount of self-healing actions occur to the memory due to the polling failure, no self-healing resource is available when the memory address occupied by the service subsequently fails, and the risk of service interruption is increased.

That is to say, in the first scheme, the second scheme and the memory fault self-healing scheme in the prior art, the problem that the self-healing resource is unreasonably utilized when the UCE fault is patrolled exists.

Aiming at the problems, the technical scheme of the application provides a method for repairing uncorrectable errors of a memory, and the method comprises the steps of defining an inspection UCE storm rule, collecting inspection UCE storm information, collecting real-time service load information, extracting the fault type information of the inspection UCE storm, and executing a corresponding self-healing mechanism request aiming at the fault types of which the number of errors generated in the inspection UCE storm is larger than a specific number when the real-time service load or the load increment exceeds a preset value. On one hand, the repair time of the routing inspection UCE is more reasonable by combining with real-time service load information, and the probability of system downtime caused by the insufficient reasonable utilization of self-healing resources is reduced; on the other hand, repair is performed on the types of the inspection UCE faults of which the number of errors in the inspection UCE storm accords with the preset value, and the problems that the unreasonable utilization of self-healing resources is caused by accidental inspection soft faults, the follow-up non-availability of the self-healing resources is caused by internal memory failure, and the system is crashed are avoided.

Fig. 1a illustrates a hardware structure diagram of a computing device that can execute the method for repairing an uncorrectable error in a memory according to an embodiment of the present disclosure. As shown in FIG. 1a, the computing device 100 includes a processor 101, a memory 102, and a BIOS chip 103. The processor 101 is a control center of the computing device, and may be, for example, a Central Processing Unit (CPU). The processor 101 executes program code to implement an operating system for the computing device 100. The memory 102 is used for storing data required for the processor 101 to operate, and can also exchange data with an external storage such as a hard disk in the computing device, for example, the operating system and software application programs may be cached in the memory 102. The BIOS chip 103 is used for detecting various hardware in the computing device 100, such as a CPU, a memory, and a motherboard. The processor 101 may call a program code to implement the method for repairing an uncorrectable error of a memory according to the embodiment of the present application, and specific implementation may be described in the following.

Fig. 1b is a schematic hardware structure diagram of another computing device that can execute the method for repairing an uncorrectable error in a memory according to the embodiment of the present application. As shown in FIG. 1b, the computing device 200 includes a processor 201, a memory 202, a BIOS chip 203, and a BMC204. The processor 201 is a control center of the computing device, and may be, for example, a Central Processing Unit (CPU). The processor 201 executes program code to implement the operating system of the computing device 200. The memory 202 is used for storing data required for the processor 201 to operate, and can also exchange data with an external storage such as a hard disk in the computing device, for example, the operating system and software application programs may be cached in the memory 202. The BIOS chip 203 is used to detect various hardware in the computing device 200, such as a CPU, a memory, and a motherboard. At least one of BMC204 or processor 201 may call a program code to implement the method for repairing an uncorrectable error in a memory according to the embodiment of the present application, and specific implementation may be described in the following.

It should be understood that the processor shown in fig. 1a and 1b may be a central processing unit CPU, and the processor may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or any conventional processor or the like.

Memory is volatile memory and can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

It should be understood that the computing device according to the embodiment of the present application may execute a method for repairing a memory fault, which is provided in the following embodiments of the present application, and the detailed description of the implementation of the method is referred to below, and is not described herein again for brevity.

The computing device may be a terminal device, such as a personal computer, a smart phone, a smart wearable device, and the like.

The computing device may also be a server, for example, an X86-architecture server, and may specifically be a blade server, a high-density server, a rack server, a high-performance server, or the like.

It should be noted that the BIOS chip may be a memory for storing binary codes, and in this specification, functions implemented by the BIOS chip may be executed by the BIOS chip, or may be executed by a processor calling a program stored in the BIOS chip.

The computing device shown in fig. 1a and fig. 1b is only a schematic diagram of a hardware structure of a computing device that can execute a method for repairing an uncorrectable error in a memory according to an embodiment of the present application, and does not limit the computing device to which the embodiment of the present application is applicable, for example, the computing device may further include a persistent storage medium, a communication interface, and the like, which are not shown in fig. 1a and fig. 1 b.

The scheme provided by the embodiment of the present application is described below by taking the computing device shown in fig. 1b as an example, and other computing devices are similar to the above, and are not described again here.

Fig. 2 is a signaling interaction diagram of the computing device shown in fig. 1b in the process of implementing the method for repairing an uncorrectable error in a memory according to the embodiment of the present application. As shown in fig. 2, the server includes a memory, a Basic Input Output System (BIOS), and a BMC with server out-of-band management, where the BIOS has a polling module and a fault repair module, and the polling module is configured to collect polling UCE in the memory and report the polling UCE to a polling UCE fault collection module in the BMC; and the inspection UCE fault collection module in the BMC is used for defining an inspection UCE storm rule and defining the default value of the error number of the inspection UCE. When the module receives the inspection UCE storm from the BIOS, the fault type in the storm is defined, the fault type with the error number larger than the preset value of the inspection UCE error number is extracted, and the fault type is sent to a fault prediction and self-healing module; a service monitoring module in the BMC monitors the load of the real-time service of the operating system, and when the service load or the service increase is larger than a preset value, service change information is sent to a fault prediction and self-healing module; when a fault prediction and self-healing module in the BMC receives a signal of a service monitoring module, selecting a self-healing mechanism of the routing inspection UCE address according to fault type information; and a command interaction module in the BMC sends the inspection UCE fault type address information and the self-healing mechanism to the BIOS, and a fault repairing module in the BIOS executes repairing of the corresponding inspection UCE fault cluster address in the memory.

Fig. 3a is a flowchart of a method for repairing an uncorrectable error of a memory according to an embodiment of the present disclosure. The method may be performed in the computing device 200 shown in FIG. 1b to enable the repair of an uncorrectable error failure of memory. As shown in fig. 3, a method for repairing an uncorrectable error of a memory according to an embodiment of the present application at least includes steps S301 to S303.

In step S301, the BIOS acquires information of the uncorrectable error in the polling of the memory.

The BIOS collects the patrol UCE in the memory to obtain patrol UCE information, which may include, for example, number information of the UCEs found in patrol, time information of occurrences of each UCE, location information, and the like.

In step S302, the BIOS sends information to the BMC to patrol uncorrectable errors.

The BIOS may periodically send the obtained polling UCE information to the BMC, for example, the BIOS sends the obtained polling UCE information to the BMC every 30 minutes, and the period may be a default or a user setting.

The BIOS can also send the acquired inspection UCE information to the BMC in real time, for example, the BIOS acquires the inspection UCE information and reports the inspection UCE information to the BMC.

The BIOS may also send the polling UCE information acquired by the BIOS to the BMC after receiving the acquisition request of the BMC. In such an example, the BMC may periodically send a get request to the BIOS, for example, the BMC sends a get request to the BIOS every 20 minutes, and the BIOS sends patrol UCE information it obtained to the BMC in response to the get request. The BMC can also send an acquisition request to the BIOS in real time, the BIOS responds to the acquisition request, if the BIOS acquires the inspection UCE information, the BIOS sends the inspection UCE information to the BMC, and if the BIOS does not acquire the inspection UCE information, the BMC feeds back that no UCE exists in the memory.

In step S303, the BMC determines the patrol UCE storm rule.

For example, the patrol UCE storm rule is that the number of patrol UCEs that occur within a certain length of time reaches a certain threshold. That is, the out-of-band management BMC determines an uncorrectable error that occurs more than or equal to a preset number of times within a preset length of time as the patrol uncorrectable error storm. The out-of-band management BMC may define the patrol UCE storm rule by setting the preset length of time (e.g., 1 minute) and a preset number of times (e.g., 10 times). The preset time length and the preset times can be set by the user according to actual conditions or can be defaulted by the out-of-band management BMC.

Illustratively, the information of the patrol UCE collected by the BIOS in the memory includes time information and quantity information of the UCE occurring in the memory during the patrol, and when the number of times of the UCE occurring in the memory reaches a preset number of times (e.g., 10 times) within a preset time length (e.g., 1 minute), the phenomenon is defined as a patrol UCE storm.

In step S304, the BMC obtains patrol UCE storm information.

The BMC obtains inspection UCE storm information according to the acquired inspection UCE information and inspection UCE storm rules, wherein the inspection UCE storm information comprises inspection UCE storm fault types, and for example, the fault types occurring in the inspection UCE storm can comprise Cell faults, row faults and Bank faults.

For example, the patrol UCE storm information is obtained through a statistical manner, for example, the time information and the quantity information of sending UCE by patrol UCE information are counted, the number of patrol UCEs occurring within a certain time length reaches a certain threshold value, that is, the patrol UCE storm rule is satisfied, the patrol UCE information within the time period is obtained, and then the fault type corresponding to the patrol UCE within the time period is determined.

In another example, the patrol UCE storm information also includes information on the number of occurrences of each fault type, e.g., 16, 18, and 10 for Cell, row, and Bank faults, respectively, in the patrol UCE storm.

The type failure types may include Cell failure, row failure, col failure, bank failure, device failure, rank failure, dimm failure, etc.

In step S305, the BMC obtains traffic load information, where the traffic load information includes a traffic load amount and/or a traffic load increase amount.

When the service load information is the CPU occupancy rate, the out-of-band management BMC determines the service load amount through collecting the CPU occupancy rate in the operating system in real time and through the CPU occupancy rate; for example, the CPU occupancy rate and the traffic load amount have a mapping relationship, and the traffic load amount can be known by knowing the CPU occupancy rate.

Or when the service load information is the memory occupancy rate, the out-of-band management BMC determines the service load amount through the memory occupancy rate by collecting the memory occupancy rate in the operating system in real time; for example, the memory occupancy rate and the traffic load amount have a mapping relationship, and the traffic load amount can be known by knowing the memory occupancy rate.

In another example, the out-of-band management BMC determines the traffic load amount by CPU occupancy and memory occupancy by collecting CPU occupancy and memory occupancy in the operating system in real time; for example, the CPU occupancy rate and the memory occupancy rate are weighted and summed, a value obtained by weighted and summed CPU occupancy rate and memory occupancy rate has a mapping relationship with the traffic load, and the traffic load can be obtained by obtaining the value obtained by weighted and summed CPU occupancy rate and memory occupancy rate.

And the out-of-band management BMC obtains the occupancy rate increment of the CPU through further calculation according to the acquired CPU occupancy rate in the operating system, and determines the service load increment based on the occupancy rate increment of the CPU. For example, the CPU occupancy increase amount and the traffic load amount have a mapping relationship, and the traffic load increase amount can be known by knowing the CPU occupancy increase amount.

And the out-of-band management BMC obtains the occupancy rate increment of the memory through further calculation according to the acquired memory occupancy rate in the operating system, and determines the service load increment based on the occupancy rate increment of the memory. For example, the memory occupancy rate increase amount and the service load amount have a mapping relation, and the service load increase amount can be known by knowing the memory occupancy rate increase amount.

The out-of-band management BMC obtains the occupancy rate increase quantity of the CPU and the occupancy rate increase quantity of the memory through further calculation according to the acquired CPU occupancy rate and the memory occupancy rate in the operating system, and determines the service load increase quantity based on the occupancy rate increase quantity of the CPU and the occupancy rate increase quantity of the memory. For example, the occupancy increase amount of the CPU and the occupancy increase amount of the memory are weighted and summed, a value obtained by weighted and summed occupancy increase amount of the CPU and occupancy increase amount of the memory has a mapping relationship with the traffic load, and the traffic load increase amount can be obtained by knowing the value obtained by weighted and summed occupancy increase amount of the CPU and occupancy increase amount of the memory.

In step S306, the BMC determines to repair the fault based on the traffic load amount and/or the traffic load increase amount.

And when the out-of-band management BMC determines that the service load is greater than or equal to a first threshold value, the fault in the inspection UCE storm is determined to be repaired. For example, the out-of-band management BMC sends a self-healing request to the BIOS for a fault in the patrol UCE storm, and the BIOS repairs the fault in the patrol UCE storm in response to the self-healing request.

In another example, the self-healing request further carries a self-healing policy for polling each type of fault in the UCE storm, and the BIOS responds to the self-healing request and repairs each type of fault based on the self-healing policy for each type of fault.

The self-healing resources are divided according to the granularity, and the granularity can sequentially comprise PCLS self-healing resources, PPR self-healing resources, ADDDC Sparing self-healing resources, device Sparing self-healing resources, rank Sparing self-healing resources and the like from small to large.

In another example, the out-of-band management BMC determines to repair the fault in the patrol UCE storm when the traffic load increase is greater than or equal to the second threshold. For example, the out-of-band management BMC sends a self-healing request to the BIOS for a fault in the patrol UCE storm, and the BIOS repairs the fault in the patrol UCE storm in response to the self-healing request.

In another example, the out-of-band management BMC determines to repair the fault in the patrol UCE storm when the traffic volume is greater than or equal to the fourth threshold and the traffic load increase is greater than or equal to the fifth threshold. For example, the out-of-band management BMC sends a self-healing request to the BIOS for a fault in the patrol UCE storm, and the BIOS repairs the fault in the patrol UCE storm in response to the self-healing request.

According to the method for repairing the uncorrectable errors of the memory, provided by the embodiment of the application, the condition of the service load of the system is considered, so that the repairing time of the routing inspection UCE is more reasonable, and the probability of system downtime caused by the unreasonable use of self-healing resources is reduced. For example, when the traffic load amount or the traffic load increase amount is greater than a preset value, a decision is made to repair the fault, for example, a self-healing mechanism request for polling the fault corresponding to the UCE is executed, so that the BIOS repairs the fault. The method has the advantages that the self-healing of the routing inspection UCE fault is combined with the real-time service load capacity or the load increment, the determined potential serious faults of the memory can be repaired at a more proper time, self-healing resources are more properly utilized, and the subsequent system downtime probability is reduced.

In another example, in order to further improve the reasonable utilization of the self-healing resources, the severity of the fault is also considered in determining to repair the fault of the memory, for example, only the fault which occurs in the polling UCE storm for a number of times greater than or equal to the third threshold is repaired to ensure that the fault is a real fault, so that the unreasonable utilization of the self-healing resources caused by the sporadic polling soft fault can be effectively avoided, and the subsequent system downtime probability is further reduced.

For example, as shown in fig. 3b, in the above-mentioned method for repairing an uncorrectable error of a memory, the method further includes:

step S307, the BMC determines a first type fault in a plurality of fault types in the patrol uncorrectable error storm, wherein the first type fault is a fault of which the occurrence frequency is larger than or equal to a third threshold value.

In step S306, the BMC determines that the fault in repairing the fault is the first type fault based on the traffic load amount and/or the traffic load increase amount. That is, the first type of fault to be repaired is found from the polling UCE storm, and then step S306 is executed to determine the amount of the root service load and/or the amount of the service load increase for the first type of fault

It is to be understood that the number of the first type faults may be one or more, and the number of the first type faults depends on the number of the corresponding plurality of fault types in the patrol UCE storm, which is greater than or equal to the third threshold. In another example, in step S306 of fig. 3a, determining that the fault in repairing the fault is a fault in polling the number of times of transmission in the UCE storm is greater than or equal to a third threshold based on the amount of traffic load and/or the amount of traffic load increase. That is, after the service load amount and/or the service load increase amount is determined to be greater than the threshold, a fault type greater than or equal to a third threshold is determined from a plurality of fault types in the polling UCE storm, when the fault type greater than or equal to the third threshold exists, a request for repairing the fault of the type is sent to the BIOS, and when the fault greater than the third threshold does not exist, the fault repairing request is not sent to the BIOS.

It can be understood that a soft failure of the memory is a failure caused by a non-hardware reason, for example, a bit flipping caused by a cosmic ray, or a memory failure caused by crosstalk of a data line, and is generally referred to as a soft failure of the memory. The fault caused by hardware is called as the real fault of the memory, the soft fault of the memory can not be reproduced generally, and the real fault caused by hardware can be reproduced generally. Therefore, the fault type of which the number of occurrences of the UCE storm is greater than or equal to a certain number (a third threshold) is extracted, and the fault of the type is generally a real fault and is a fault needing to be repaired.

For example, the out-of-band management BMC collects patrol UCE information, and extracts a fault, the number of which is greater than or equal to a third threshold (e.g., 5 times), in a patrol UCE storm as a target fault when the number of the patrol UCE information collected within a certain time meets a patrol UCE storm rule. The out-of-band management BMC only sends a self-healing request of the target fault to the BIOS so that the BIOS repairs the target fault and self-healing of the target fault is achieved.

Illustratively, the faults occurring in the polling UCE storm include a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, and a dim fault, wherein the number of times of occurrence of the Cell fault is 16, the number of times of occurrence of the Row fault is 10, the number of times of occurrence of the Col fault is 8, the number of times of occurrence of the Bank fault is 2, the number of times of occurrence of the Device fault is 0, the number of times of occurrence of the Rank fault is 0, the number of times of occurrence of the dim fault is 0, and a third threshold is 5, and then the fault with the number of times of occurrence in the polling UCE storm greater than 5 is determined as a target fault, namely the target faults are the Cell fault, the Row fault, and the Col fault, and the out-of-band management BMC only needs to determine self-healing strategies for the Cell fault, the Row fault, and the Col fault, and send a self-healing request for the BIOS fault, the Cell fault, the Row fault, the Col fault, and the Cell fault to repair the Cell fault.

The above-mentioned thresholds (e.g., the first threshold, the second threshold, the third threshold, the fourth threshold, and the fifth threshold) may be default in the system, may be set by an external input, or may be determined based on a fault analysis model and an intelligent analysis manner, and the specific manner of determining the thresholds in the embodiment of the present application is not limited, and may be determined in an appropriate manner according to actual circumstances.

As can be seen from the foregoing description, the execution subject of the repair method for an uncorrectable error of a memory described above is a processing unit in an out-of-band management BMC (BMC in fig. 1 b), and in another example, the execution subject of the repair method for an uncorrectable error of a memory provided in this embodiment of the present application may also be a processor (e.g., the processor 101 in fig. 1 a) in a computing device (terminal device or server), and an implementation method is similar to that of the execution subject being an out-of-band management BMC, and can join the foregoing description, and for brevity, no further description is provided here.

Based on the same concept as the foregoing embodiments of the method for repairing an uncorrectable error of a memory, in the embodiments of the present application, a processing apparatus 400 is further provided, where the processing apparatus 400 includes a unit or a module for implementing each step in the method for repairing an uncorrectable error of a memory shown in fig. 3a and/or fig. 3 b.

Fig. 4 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application. The apparatus is applied to the computing device shown in fig. 1a or fig. 1b, and may be, for example, the processor 101 in fig. 1a or a processing unit in the BMC204 in fig. 1b, as shown in fig. 4, where the processing apparatus 400 at least includes:

an obtaining module 401, configured to obtain information of an inspection uncorrectable error storm of a memory, where the information of the inspection uncorrectable error storm includes a fault corresponding to the inspection uncorrectable error storm;

a determining module 402, configured to determine to repair the fault based on the traffic load and/or the traffic load increase.

In one possible implementation, the determining module 402 is specifically configured to:

In another possible implementation, the determining module 402 is specifically configured to:

In another possible implementation, the fault includes a first type of fault, and the first type of fault is repaired when the number of times of the first type of fault is greater than a fourth threshold.

In another possible implementation, the processing apparatus 400 further includes:

a setting module 403, configured to determine an uncorrectable error whose occurrence frequency is greater than or equal to a preset frequency within a preset time length as the patrol uncorrectable error storm.

The processing apparatus 400 according to the embodiment of the present application may correspond to perform the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the processing apparatus 400 are respectively for implementing corresponding flows of each method in fig. 3a and/or fig. 3b, and are not described herein again for brevity.

It should be explained that the method for repairing an uncorrectable error of a memory provided by the embodiment of the present application may be executed by the processor 101 in fig. 1a, for example, the processor 101 calls a program code, the program code includes one or more software modules, and a computing device executes the program code by the processor 101 to implement a method for repairing an uncorrectable error of a memory provided by the embodiment of the present application.

The method for repairing an uncorrectable error of a memory provided in the embodiment of the present application may be executed by a processing unit in the out-of-band management BMC204 in fig. 1b, for example, the out-of-band management BMC204 stores a program code corresponding to the method for repairing an uncorrectable error of a memory provided in the embodiment of the present application, where the program code includes one or more software modules, and the computing device executes the program code through the processing unit in the out-of-band management BMC, so as to implement the method for repairing an uncorrectable error of a memory provided in the embodiment of the present application.

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the above-mentioned method of repair of an uncorrectable error for a memory to be implemented.

Embodiments of the present application provide a chip comprising at least one processor and an interface, the at least one processor determining program instructions or data through the interface; the at least one processor is configured to execute the program instructions to implement the above-mentioned method for repairing an uncorrectable error of a memory.

Embodiments of the present application provide a computer program or computer program product comprising instructions which, when executed, cause a computer to perform the above-mentioned method of repairing an uncorrectable error for a memory.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the specific application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A method for repairing an uncorrectable error in a memory, comprising:

acquiring fault information of a memory, wherein the fault information comprises uncorrectable error information of the memory;

based on the information of the uncorrectable errors, when the fault is determined to be an uncorrectable error storm:

and determining to repair the fault based on the service load and/or the service load increment.

2. The method of claim 1, wherein the determining to repair the fault based on the traffic load comprises:

3. The method of claim 1, wherein the determining to repair the fault based on the amount of traffic load increase comprises:

and when the service load increment is greater than or equal to a second threshold value, determining to repair the fault.

4. The method according to any one of claims 1-3, wherein the fault comprises a first type of fault, and wherein the first type of fault is repaired when the number of times of the first type of fault is greater than or equal to a third threshold, wherein the first type of fault comprises at least one of a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.

5. The method of any one of claims 1-4, further comprising, the fault information including a time of occurrence of the fault, the uncorrectable error storm being an uncorrectable error occurring a number of times greater than or equal to a first number of times within a first length of time.

6. A processing apparatus, comprising:

based on the information of the uncorrectable errors, when the fault is determined to be an uncorrectable error storm;

7. The apparatus of claim 6, wherein the determining module is specifically configured to:

8. The apparatus of claim 6, wherein the determining module is specifically configured to:

9. The apparatus of any of claims 6-8, wherein the fault comprises a first type of fault, and wherein the first type of fault is repaired when a number of times the first type of fault is greater than a fourth threshold, the first type of fault comprising at least one of a Cel fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.

10. The apparatus of any one of claims 6-9, further comprising:

and the setting module is used for determining the uncorrectable errors with the times larger than or equal to the preset times within the preset time length as the patrol uncorrectable error storm.

11. A computing device comprising a memory and a processor, wherein the memory has stored therein instructions that, when executed by the processor, cause the method of any of claims 1-5 to be implemented.

12. A computing device comprising a memory, a processor, and a BMC, wherein at least one of the processor or the BMC is operable to perform the method of any of claims 1-5.