CN115168088A - Method and device for repairing uncorrectable errors of memory - Google Patents
Method and device for repairing uncorrectable errors of memory Download PDFInfo
- Publication number
- CN115168088A CN115168088A CN202210797617.5A CN202210797617A CN115168088A CN 115168088 A CN115168088 A CN 115168088A CN 202210797617 A CN202210797617 A CN 202210797617A CN 115168088 A CN115168088 A CN 115168088A
- Authority
- CN
- China
- Prior art keywords
- fault
- memory
- information
- service load
- uce
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015654 memory Effects 0.000 title claims abstract description 134
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000008439 repair process Effects 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims description 20
- 238000007689 inspection Methods 0.000 abstract description 44
- 230000007246 mechanism Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 6
- 238000013507 mapping Methods 0.000 description 6
- 230000004044 response Effects 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000006748 scratching Methods 0.000 description 3
- 230000002393 scratching effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000011664 signaling Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- JJWKPURADFRFRB-UHFFFAOYSA-N carbonyl sulfide Chemical compound O=C=S JJWKPURADFRFRB-UHFFFAOYSA-N 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/073—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
The application provides a method for repairing uncorrectable errors of a memory, which comprises the following steps: acquiring fault information of the memory, wherein the fault information comprises uncorrectable error information of the memory, and when the fault is determined to be an uncorrectable error storm based on the uncorrectable error information; acquiring service load information, wherein the service load information comprises service load and/or service load increment; and determining to repair the fault of the memory based on the service load and/or the service load increment. The method for repairing the uncorrectable errors of the memory considers the condition of the service load of the system, so that the repairing time of the routing inspection UCE is more reasonable, and the probability of system downtime caused by less-reasonable use of self-healing resources is reduced.
Description
Technical Field
The present application relates to the field of server technologies, and in particular, to a method and an apparatus for repairing an uncorrectable error in a memory.
Background
A Dynamic Random Access Memory (DRAM) is a type of random access memory that has wide applications in the storage and IT fields. As the integration of DRAM is higher and the manufacturing process is smaller, the basic failure rate is higher. Memory failures have become one of the important causes of server downtime.
The mainstream server system generally has a reliability, availability, and serviceability (RAS) mechanism of memory polling, and for memories with higher and higher basic failure rates, the probability of occurrence of a polling Correctable Error (CE) or a polling uncorrectable Error (UCE) is continuously increased. Memory typically fails due to UCE, which in turn causes the server to go down.
Disclosure of Invention
The embodiment of the application provides a method for repairing uncorrectable errors of a memory, which considers the system service load condition on the basis of solving the problem of routing inspection of UCE faults, so that the repair time of routing inspection of UCE is more reasonable, and the probability of system downtime caused by less-reasonable use of self-healing resources is reduced.
In a first aspect, an embodiment of the present application provides a method for repairing an uncorrectable error of a memory, including obtaining fault information of the memory, where the fault information includes uncorrectable error information of the memory; based on the uncorrectable error information, when the fault is determined to be an uncorrectable error storm, acquiring service load information, wherein the service load information comprises a service load and/or a service load increment; and determining to repair the fault of the memory based on the service load and/or the service load increment.
The method for repairing the uncorrectable errors of the memory, provided by the embodiment of the application, considers the condition of the service load of the system, so that the repairing time of the inspection UCE is more reasonable, and the probability of system downtime caused by less-reasonable use of self-healing resources is reduced.
In one possible implementation, determining to repair the fault based on the load information includes: and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.
In another possible implementation, determining to repair the fault based on the load information includes: and when the traffic load increase is larger than or equal to a second threshold value, determining to repair the fault.
In the possible implementation, when the traffic load amount or the traffic load increase amount is greater than the preset value, a decision is made to repair the fault, for example, a self-healing mechanism request for routing inspection of the fault corresponding to the UCE is executed, so that the BIOS repairs the fault. The method has the advantages that the self-healing of the routing inspection UCE faults is combined with the real-time service load capacity or the load increment, the determined potential serious faults of the memory can be repaired at a more proper time, self-healing resources are more properly utilized, and the subsequent system downtime probability is reduced.
In another possible implementation, the amount of traffic load is determined based on CPU occupancy and/or memory occupancy.
In another possible implementation, the amount of traffic load increase is determined based on an amount of increase in CPU occupancy and/or an amount of increase in memory occupancy.
In another possible implementation, the fault includes a first type of fault, and the first type of fault is repaired when the number of times of the first type of fault is greater than a third threshold, and the first type of fault includes at least one of a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.
That is to say, only the fault with the frequency greater than a certain threshold value in the inspection UCE storm is repaired to ensure that the fault is a real fault, so that unreasonable utilization of self-healing resources caused by accidental inspection soft faults can be effectively avoided, and the downtime probability of a subsequent system is further reduced.
In another possible implementation, the fault includes one or more of a Cell fault, row fault, col fault, bank fault, device fault, rank fault, or dim fault.
In another possible implementation, the method further includes determining an uncorrectable error, the number of which is greater than or equal to a preset number of times within a preset time length, as the patrol uncorrectable error storm, that is, defining a patrol UCE storm, where a phenomenon that a certain number of times or more of UCE occurs within a certain time is defined as the patrol UCE storm.
In a second aspect, an embodiment of the present application provides a processing apparatus, including,
the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring fault information of a memory, and the fault information comprises uncorrectable error information of the memory;
when the fault is determined to be an uncorrectable error storm based on the information for the uncorrectable error;
acquiring service load information, wherein the service load information comprises service load and/or service load increase;
and the determining module is used for determining to repair the fault based on the service load and/or the service load increment.
In one possible implementation, the determining module is specifically configured to:
and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.
In another possible implementation, the determining module is specifically configured to:
and when the service load increment is larger than or equal to a second threshold value, determining to repair the fault.
In another possible implementation, the traffic load amount is determined based on CPU occupancy and/or memory occupancy.
In another possible implementation, the traffic load increase is determined based on an increase in CPU occupancy and/or an increase in memory occupancy.
In another possible implementation, the fault includes a first type of fault, and the first type of fault is repaired when the number of times of the first type of fault is greater than a third threshold, and the first type of fault includes at least one of a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.
In another possible implementation, the fault includes one or more of a Cell fault, row fault, col fault, bank fault, device fault, rank fault, or dim fault.
In another possible implementation, the processing device further includes:
and the setting module is used for determining the uncorrectable errors with the times larger than or equal to the preset times in the preset time length as the patrol uncorrectable error storm.
In a third aspect, an embodiment of the present application provides a computing device, including a memory and a processor, where the memory stores instructions, and when the instructions are executed by the processor, the method in the first aspect is implemented.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, causes the method of the first aspect to be implemented.
In a fifth aspect, the present application also provides a computer program or a computer program product, which includes instructions that, when executed, make a computer perform the method of the first aspect.
In a sixth aspect, an embodiment of the present application further provides a chip, which includes at least one processor and a communication interface, where the processor is configured to execute the method in the first aspect.
Drawings
Fig. 1a illustrates a hardware structure diagram of a computing device that can execute a method for repairing an uncorrectable error in a memory according to an embodiment of the present application;
fig. 1b is a schematic hardware structure diagram of another computing device that can execute the method for repairing an uncorrectable error in a memory according to the embodiment of the present application;
fig. 2 is a signaling interaction diagram of the computing device shown in fig. 1b in the process of implementing the method for repairing an uncorrectable error in a memory according to the embodiment of the present application;
fig. 3a is a flowchart of a method for repairing an uncorrectable error of a memory according to an embodiment of the present disclosure;
fig. 3b is a flowchart of another method for repairing an uncorrectable error of a memory according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solution of the present application is further described in detail by the accompanying drawings and examples.
For the convenience of understanding, the meanings of the technical terms referred to hereinafter in the text are briefly explained below.
Patrol (patroll screening): in the RAS technology in a computer operating system, a CPU reads the content in a memory when the CPU is idle, if the read data has correctable errors, the corrected data is rewritten into the memory, and the probability of uncorrectable errors is reduced by correcting single-bit errors in time.
Storm (Storm): a behavior is characterized by producing a certain amount in a certain time.
Patrol CE (control scattering Corrected Error): correctable errors identified by a polling mechanism.
Patrol UCE (control scratching unordered Error): uncorrectable errors identified by the polling mechanism typically do not result in interruption of server traffic or system downtime.
Patrol CE Storm (paterol scratching Corrected Error Storm): a certain number of correctable errors identified by the polling mechanism within a certain time.
Patrol UCE Storm (facial scratching unordered Error Storm): within a certain time, a certain number of uncorrectable errors are identified by the polling mechanism, and such errors generally do not directly cause interruption of server traffic or system downtime.
Read CE (Read Corrected Error): correctable errors are produced when the CPU consumes memory data.
The real UCE (True Uncorrected Error) is an uncorrectable Error generated when the CPU consumes memory data, including an uncorrectable Error (UCNA) that is not activated and a Software Recoverable Action Required (SRAR), and such uncorrectable Error generally causes a service interruption of the server.
Failure (pattern): refers to some form or manifestation of failure behavior.
In order to solve the problem of server downtime caused by memory failure, the first scheme is as follows: the UCE information of the memory polling is collected by a Base Management Controller (BMC) of the out-of-band management, and the UCE information of the memory polling is recorded in an FDM log.
However, in the scheme, only the memory patrol UCE information is recorded, and the fault repair executed on the memory is not performed, so that the real fault still exists in the memory, and when the service load is increased, once the fault address of the memory is used, the risk of causing the interruption of the service of the server system is increased.
The second scheme is as follows: and collecting fault information by the out-of-band management BMC, defining a fault characteristic mode and a repair request according to partial information of the routing inspection error, and sending the fault characteristic mode and the repair request to the BIOS to execute a corresponding isolation mechanism so as to realize prediction and self-healing of the routing inspection error.
The scheme does not consider the condition of system service load, when a large amount of self-healing actions occur to the memory due to the polling failure, no self-healing resource is available when the memory address occupied by the service subsequently fails, and the risk of service interruption is increased.
That is to say, in the first scheme, the second scheme and the memory fault self-healing scheme in the prior art, the problem that the self-healing resource is unreasonably utilized when the UCE fault is patrolled exists.
Aiming at the problems, the technical scheme of the application provides a method for repairing uncorrectable errors of a memory, and the method comprises the steps of defining an inspection UCE storm rule, collecting inspection UCE storm information, collecting real-time service load information, extracting the fault type information of the inspection UCE storm, and executing a corresponding self-healing mechanism request aiming at the fault types of which the number of errors generated in the inspection UCE storm is larger than a specific number when the real-time service load or the load increment exceeds a preset value. On one hand, the repair time of the routing inspection UCE is more reasonable by combining with real-time service load information, and the probability of system downtime caused by the insufficient reasonable utilization of self-healing resources is reduced; on the other hand, repair is performed on the types of the inspection UCE faults of which the number of errors in the inspection UCE storm accords with the preset value, and the problems that the unreasonable utilization of self-healing resources is caused by accidental inspection soft faults, the follow-up non-availability of the self-healing resources is caused by internal memory failure, and the system is crashed are avoided.
Fig. 1a illustrates a hardware structure diagram of a computing device that can execute the method for repairing an uncorrectable error in a memory according to an embodiment of the present disclosure. As shown in FIG. 1a, the computing device 100 includes a processor 101, a memory 102, and a BIOS chip 103. The processor 101 is a control center of the computing device, and may be, for example, a Central Processing Unit (CPU). The processor 101 executes program code to implement an operating system for the computing device 100. The memory 102 is used for storing data required for the processor 101 to operate, and can also exchange data with an external storage such as a hard disk in the computing device, for example, the operating system and software application programs may be cached in the memory 102. The BIOS chip 103 is used for detecting various hardware in the computing device 100, such as a CPU, a memory, and a motherboard. The processor 101 may call a program code to implement the method for repairing an uncorrectable error of a memory according to the embodiment of the present application, and specific implementation may be described in the following.
Fig. 1b is a schematic hardware structure diagram of another computing device that can execute the method for repairing an uncorrectable error in a memory according to the embodiment of the present application. As shown in FIG. 1b, the computing device 200 includes a processor 201, a memory 202, a BIOS chip 203, and a BMC204. The processor 201 is a control center of the computing device, and may be, for example, a Central Processing Unit (CPU). The processor 201 executes program code to implement the operating system of the computing device 200. The memory 202 is used for storing data required for the processor 201 to operate, and can also exchange data with an external storage such as a hard disk in the computing device, for example, the operating system and software application programs may be cached in the memory 202. The BIOS chip 203 is used to detect various hardware in the computing device 200, such as a CPU, a memory, and a motherboard. At least one of BMC204 or processor 201 may call a program code to implement the method for repairing an uncorrectable error in a memory according to the embodiment of the present application, and specific implementation may be described in the following.
It should be understood that the processor shown in fig. 1a and 1b may be a central processing unit CPU, and the processor may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or any conventional processor or the like.
Memory is volatile memory and can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).
It should be understood that the computing device according to the embodiment of the present application may execute a method for repairing a memory fault, which is provided in the following embodiments of the present application, and the detailed description of the implementation of the method is referred to below, and is not described herein again for brevity.
The computing device may be a terminal device, such as a personal computer, a smart phone, a smart wearable device, and the like.
The computing device may also be a server, for example, an X86-architecture server, and may specifically be a blade server, a high-density server, a rack server, a high-performance server, or the like.
It should be noted that the BIOS chip may be a memory for storing binary codes, and in this specification, functions implemented by the BIOS chip may be executed by the BIOS chip, or may be executed by a processor calling a program stored in the BIOS chip.
The computing device shown in fig. 1a and fig. 1b is only a schematic diagram of a hardware structure of a computing device that can execute a method for repairing an uncorrectable error in a memory according to an embodiment of the present application, and does not limit the computing device to which the embodiment of the present application is applicable, for example, the computing device may further include a persistent storage medium, a communication interface, and the like, which are not shown in fig. 1a and fig. 1 b.
The scheme provided by the embodiment of the present application is described below by taking the computing device shown in fig. 1b as an example, and other computing devices are similar to the above, and are not described again here.
Fig. 2 is a signaling interaction diagram of the computing device shown in fig. 1b in the process of implementing the method for repairing an uncorrectable error in a memory according to the embodiment of the present application. As shown in fig. 2, the server includes a memory, a Basic Input Output System (BIOS), and a BMC with server out-of-band management, where the BIOS has a polling module and a fault repair module, and the polling module is configured to collect polling UCE in the memory and report the polling UCE to a polling UCE fault collection module in the BMC; and the inspection UCE fault collection module in the BMC is used for defining an inspection UCE storm rule and defining the default value of the error number of the inspection UCE. When the module receives the inspection UCE storm from the BIOS, the fault type in the storm is defined, the fault type with the error number larger than the preset value of the inspection UCE error number is extracted, and the fault type is sent to a fault prediction and self-healing module; a service monitoring module in the BMC monitors the load of the real-time service of the operating system, and when the service load or the service increase is larger than a preset value, service change information is sent to a fault prediction and self-healing module; when a fault prediction and self-healing module in the BMC receives a signal of a service monitoring module, selecting a self-healing mechanism of the routing inspection UCE address according to fault type information; and a command interaction module in the BMC sends the inspection UCE fault type address information and the self-healing mechanism to the BIOS, and a fault repairing module in the BIOS executes repairing of the corresponding inspection UCE fault cluster address in the memory.
Fig. 3a is a flowchart of a method for repairing an uncorrectable error of a memory according to an embodiment of the present disclosure. The method may be performed in the computing device 200 shown in FIG. 1b to enable the repair of an uncorrectable error failure of memory. As shown in fig. 3, a method for repairing an uncorrectable error of a memory according to an embodiment of the present application at least includes steps S301 to S303.
In step S301, the BIOS acquires information of the uncorrectable error in the polling of the memory.
The BIOS collects the patrol UCE in the memory to obtain patrol UCE information, which may include, for example, number information of the UCEs found in patrol, time information of occurrences of each UCE, location information, and the like.
In step S302, the BIOS sends information to the BMC to patrol uncorrectable errors.
The BIOS may periodically send the obtained polling UCE information to the BMC, for example, the BIOS sends the obtained polling UCE information to the BMC every 30 minutes, and the period may be a default or a user setting.
The BIOS can also send the acquired inspection UCE information to the BMC in real time, for example, the BIOS acquires the inspection UCE information and reports the inspection UCE information to the BMC.
The BIOS may also send the polling UCE information acquired by the BIOS to the BMC after receiving the acquisition request of the BMC. In such an example, the BMC may periodically send a get request to the BIOS, for example, the BMC sends a get request to the BIOS every 20 minutes, and the BIOS sends patrol UCE information it obtained to the BMC in response to the get request. The BMC can also send an acquisition request to the BIOS in real time, the BIOS responds to the acquisition request, if the BIOS acquires the inspection UCE information, the BIOS sends the inspection UCE information to the BMC, and if the BIOS does not acquire the inspection UCE information, the BMC feeds back that no UCE exists in the memory.
In step S303, the BMC determines the patrol UCE storm rule.
For example, the patrol UCE storm rule is that the number of patrol UCEs that occur within a certain length of time reaches a certain threshold. That is, the out-of-band management BMC determines an uncorrectable error that occurs more than or equal to a preset number of times within a preset length of time as the patrol uncorrectable error storm. The out-of-band management BMC may define the patrol UCE storm rule by setting the preset length of time (e.g., 1 minute) and a preset number of times (e.g., 10 times). The preset time length and the preset times can be set by the user according to actual conditions or can be defaulted by the out-of-band management BMC.
Illustratively, the information of the patrol UCE collected by the BIOS in the memory includes time information and quantity information of the UCE occurring in the memory during the patrol, and when the number of times of the UCE occurring in the memory reaches a preset number of times (e.g., 10 times) within a preset time length (e.g., 1 minute), the phenomenon is defined as a patrol UCE storm.
In step S304, the BMC obtains patrol UCE storm information.
The BMC obtains inspection UCE storm information according to the acquired inspection UCE information and inspection UCE storm rules, wherein the inspection UCE storm information comprises inspection UCE storm fault types, and for example, the fault types occurring in the inspection UCE storm can comprise Cell faults, row faults and Bank faults.
For example, the patrol UCE storm information is obtained through a statistical manner, for example, the time information and the quantity information of sending UCE by patrol UCE information are counted, the number of patrol UCEs occurring within a certain time length reaches a certain threshold value, that is, the patrol UCE storm rule is satisfied, the patrol UCE information within the time period is obtained, and then the fault type corresponding to the patrol UCE within the time period is determined.
In another example, the patrol UCE storm information also includes information on the number of occurrences of each fault type, e.g., 16, 18, and 10 for Cell, row, and Bank faults, respectively, in the patrol UCE storm.
The type failure types may include Cell failure, row failure, col failure, bank failure, device failure, rank failure, dimm failure, etc.
In step S305, the BMC obtains traffic load information, where the traffic load information includes a traffic load amount and/or a traffic load increase amount.
When the service load information is the CPU occupancy rate, the out-of-band management BMC determines the service load amount through collecting the CPU occupancy rate in the operating system in real time and through the CPU occupancy rate; for example, the CPU occupancy rate and the traffic load amount have a mapping relationship, and the traffic load amount can be known by knowing the CPU occupancy rate.
Or when the service load information is the memory occupancy rate, the out-of-band management BMC determines the service load amount through the memory occupancy rate by collecting the memory occupancy rate in the operating system in real time; for example, the memory occupancy rate and the traffic load amount have a mapping relationship, and the traffic load amount can be known by knowing the memory occupancy rate.
In another example, the out-of-band management BMC determines the traffic load amount by CPU occupancy and memory occupancy by collecting CPU occupancy and memory occupancy in the operating system in real time; for example, the CPU occupancy rate and the memory occupancy rate are weighted and summed, a value obtained by weighted and summed CPU occupancy rate and memory occupancy rate has a mapping relationship with the traffic load, and the traffic load can be obtained by obtaining the value obtained by weighted and summed CPU occupancy rate and memory occupancy rate.
And the out-of-band management BMC obtains the occupancy rate increment of the CPU through further calculation according to the acquired CPU occupancy rate in the operating system, and determines the service load increment based on the occupancy rate increment of the CPU. For example, the CPU occupancy increase amount and the traffic load amount have a mapping relationship, and the traffic load increase amount can be known by knowing the CPU occupancy increase amount.
And the out-of-band management BMC obtains the occupancy rate increment of the memory through further calculation according to the acquired memory occupancy rate in the operating system, and determines the service load increment based on the occupancy rate increment of the memory. For example, the memory occupancy rate increase amount and the service load amount have a mapping relation, and the service load increase amount can be known by knowing the memory occupancy rate increase amount.
The out-of-band management BMC obtains the occupancy rate increase quantity of the CPU and the occupancy rate increase quantity of the memory through further calculation according to the acquired CPU occupancy rate and the memory occupancy rate in the operating system, and determines the service load increase quantity based on the occupancy rate increase quantity of the CPU and the occupancy rate increase quantity of the memory. For example, the occupancy increase amount of the CPU and the occupancy increase amount of the memory are weighted and summed, a value obtained by weighted and summed occupancy increase amount of the CPU and occupancy increase amount of the memory has a mapping relationship with the traffic load, and the traffic load increase amount can be obtained by knowing the value obtained by weighted and summed occupancy increase amount of the CPU and occupancy increase amount of the memory.
In step S306, the BMC determines to repair the fault based on the traffic load amount and/or the traffic load increase amount.
And when the out-of-band management BMC determines that the service load is greater than or equal to a first threshold value, the fault in the inspection UCE storm is determined to be repaired. For example, the out-of-band management BMC sends a self-healing request to the BIOS for a fault in the patrol UCE storm, and the BIOS repairs the fault in the patrol UCE storm in response to the self-healing request.
In another example, the self-healing request further carries a self-healing policy for polling each type of fault in the UCE storm, and the BIOS responds to the self-healing request and repairs each type of fault based on the self-healing policy for each type of fault.
The self-healing resources are divided according to the granularity, and the granularity can sequentially comprise PCLS self-healing resources, PPR self-healing resources, ADDDC Sparing self-healing resources, device Sparing self-healing resources, rank Sparing self-healing resources and the like from small to large.
In another example, the out-of-band management BMC determines to repair the fault in the patrol UCE storm when the traffic load increase is greater than or equal to the second threshold. For example, the out-of-band management BMC sends a self-healing request to the BIOS for a fault in the patrol UCE storm, and the BIOS repairs the fault in the patrol UCE storm in response to the self-healing request.
In another example, the out-of-band management BMC determines to repair the fault in the patrol UCE storm when the traffic volume is greater than or equal to the fourth threshold and the traffic load increase is greater than or equal to the fifth threshold. For example, the out-of-band management BMC sends a self-healing request to the BIOS for a fault in the patrol UCE storm, and the BIOS repairs the fault in the patrol UCE storm in response to the self-healing request.
According to the method for repairing the uncorrectable errors of the memory, provided by the embodiment of the application, the condition of the service load of the system is considered, so that the repairing time of the routing inspection UCE is more reasonable, and the probability of system downtime caused by the unreasonable use of self-healing resources is reduced. For example, when the traffic load amount or the traffic load increase amount is greater than a preset value, a decision is made to repair the fault, for example, a self-healing mechanism request for polling the fault corresponding to the UCE is executed, so that the BIOS repairs the fault. The method has the advantages that the self-healing of the routing inspection UCE fault is combined with the real-time service load capacity or the load increment, the determined potential serious faults of the memory can be repaired at a more proper time, self-healing resources are more properly utilized, and the subsequent system downtime probability is reduced.
In another example, in order to further improve the reasonable utilization of the self-healing resources, the severity of the fault is also considered in determining to repair the fault of the memory, for example, only the fault which occurs in the polling UCE storm for a number of times greater than or equal to the third threshold is repaired to ensure that the fault is a real fault, so that the unreasonable utilization of the self-healing resources caused by the sporadic polling soft fault can be effectively avoided, and the subsequent system downtime probability is further reduced.
For example, as shown in fig. 3b, in the above-mentioned method for repairing an uncorrectable error of a memory, the method further includes:
step S307, the BMC determines a first type fault in a plurality of fault types in the patrol uncorrectable error storm, wherein the first type fault is a fault of which the occurrence frequency is larger than or equal to a third threshold value.
In step S306, the BMC determines that the fault in repairing the fault is the first type fault based on the traffic load amount and/or the traffic load increase amount. That is, the first type of fault to be repaired is found from the polling UCE storm, and then step S306 is executed to determine the amount of the root service load and/or the amount of the service load increase for the first type of fault
It is to be understood that the number of the first type faults may be one or more, and the number of the first type faults depends on the number of the corresponding plurality of fault types in the patrol UCE storm, which is greater than or equal to the third threshold. In another example, in step S306 of fig. 3a, determining that the fault in repairing the fault is a fault in polling the number of times of transmission in the UCE storm is greater than or equal to a third threshold based on the amount of traffic load and/or the amount of traffic load increase. That is, after the service load amount and/or the service load increase amount is determined to be greater than the threshold, a fault type greater than or equal to a third threshold is determined from a plurality of fault types in the polling UCE storm, when the fault type greater than or equal to the third threshold exists, a request for repairing the fault of the type is sent to the BIOS, and when the fault greater than the third threshold does not exist, the fault repairing request is not sent to the BIOS.
It can be understood that a soft failure of the memory is a failure caused by a non-hardware reason, for example, a bit flipping caused by a cosmic ray, or a memory failure caused by crosstalk of a data line, and is generally referred to as a soft failure of the memory. The fault caused by hardware is called as the real fault of the memory, the soft fault of the memory can not be reproduced generally, and the real fault caused by hardware can be reproduced generally. Therefore, the fault type of which the number of occurrences of the UCE storm is greater than or equal to a certain number (a third threshold) is extracted, and the fault of the type is generally a real fault and is a fault needing to be repaired.
For example, the out-of-band management BMC collects patrol UCE information, and extracts a fault, the number of which is greater than or equal to a third threshold (e.g., 5 times), in a patrol UCE storm as a target fault when the number of the patrol UCE information collected within a certain time meets a patrol UCE storm rule. The out-of-band management BMC only sends a self-healing request of the target fault to the BIOS so that the BIOS repairs the target fault and self-healing of the target fault is achieved.
Illustratively, the faults occurring in the polling UCE storm include a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, and a dim fault, wherein the number of times of occurrence of the Cell fault is 16, the number of times of occurrence of the Row fault is 10, the number of times of occurrence of the Col fault is 8, the number of times of occurrence of the Bank fault is 2, the number of times of occurrence of the Device fault is 0, the number of times of occurrence of the Rank fault is 0, the number of times of occurrence of the dim fault is 0, and a third threshold is 5, and then the fault with the number of times of occurrence in the polling UCE storm greater than 5 is determined as a target fault, namely the target faults are the Cell fault, the Row fault, and the Col fault, and the out-of-band management BMC only needs to determine self-healing strategies for the Cell fault, the Row fault, and the Col fault, and send a self-healing request for the BIOS fault, the Cell fault, the Row fault, the Col fault, and the Cell fault to repair the Cell fault.
The above-mentioned thresholds (e.g., the first threshold, the second threshold, the third threshold, the fourth threshold, and the fifth threshold) may be default in the system, may be set by an external input, or may be determined based on a fault analysis model and an intelligent analysis manner, and the specific manner of determining the thresholds in the embodiment of the present application is not limited, and may be determined in an appropriate manner according to actual circumstances.
As can be seen from the foregoing description, the execution subject of the repair method for an uncorrectable error of a memory described above is a processing unit in an out-of-band management BMC (BMC in fig. 1 b), and in another example, the execution subject of the repair method for an uncorrectable error of a memory provided in this embodiment of the present application may also be a processor (e.g., the processor 101 in fig. 1 a) in a computing device (terminal device or server), and an implementation method is similar to that of the execution subject being an out-of-band management BMC, and can join the foregoing description, and for brevity, no further description is provided here.
Based on the same concept as the foregoing embodiments of the method for repairing an uncorrectable error of a memory, in the embodiments of the present application, a processing apparatus 400 is further provided, where the processing apparatus 400 includes a unit or a module for implementing each step in the method for repairing an uncorrectable error of a memory shown in fig. 3a and/or fig. 3 b.
Fig. 4 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application. The apparatus is applied to the computing device shown in fig. 1a or fig. 1b, and may be, for example, the processor 101 in fig. 1a or a processing unit in the BMC204 in fig. 1b, as shown in fig. 4, where the processing apparatus 400 at least includes:
an obtaining module 401, configured to obtain information of an inspection uncorrectable error storm of a memory, where the information of the inspection uncorrectable error storm includes a fault corresponding to the inspection uncorrectable error storm;
acquiring service load information, wherein the service load information comprises service load and/or service load increase;
a determining module 402, configured to determine to repair the fault based on the traffic load and/or the traffic load increase.
In one possible implementation, the determining module 402 is specifically configured to:
and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.
In another possible implementation, the determining module 402 is specifically configured to:
and when the service load increment is larger than or equal to a second threshold value, determining to repair the fault.
In another possible implementation, the traffic load amount is determined based on CPU occupancy and/or memory occupancy.
In another possible implementation, the traffic load increase is determined based on an increase in CPU occupancy and/or an increase in memory occupancy.
In another possible implementation, the fault includes a first type of fault, and the first type of fault is repaired when the number of times of the first type of fault is greater than a fourth threshold.
In another possible implementation, the fault includes one or more of a Cell fault, row fault, col fault, bank fault, device fault, rank fault, or dim fault.
In another possible implementation, the processing apparatus 400 further includes:
a setting module 403, configured to determine an uncorrectable error whose occurrence frequency is greater than or equal to a preset frequency within a preset time length as the patrol uncorrectable error storm.
The processing apparatus 400 according to the embodiment of the present application may correspond to perform the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the processing apparatus 400 are respectively for implementing corresponding flows of each method in fig. 3a and/or fig. 3b, and are not described herein again for brevity.
It should be explained that the method for repairing an uncorrectable error of a memory provided by the embodiment of the present application may be executed by the processor 101 in fig. 1a, for example, the processor 101 calls a program code, the program code includes one or more software modules, and a computing device executes the program code by the processor 101 to implement a method for repairing an uncorrectable error of a memory provided by the embodiment of the present application.
The method for repairing an uncorrectable error of a memory provided in the embodiment of the present application may be executed by a processing unit in the out-of-band management BMC204 in fig. 1b, for example, the out-of-band management BMC204 stores a program code corresponding to the method for repairing an uncorrectable error of a memory provided in the embodiment of the present application, where the program code includes one or more software modules, and the computing device executes the program code through the processing unit in the out-of-band management BMC, so as to implement the method for repairing an uncorrectable error of a memory provided in the embodiment of the present application.
Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the above-mentioned method of repair of an uncorrectable error for a memory to be implemented.
Embodiments of the present application provide a chip comprising at least one processor and an interface, the at least one processor determining program instructions or data through the interface; the at least one processor is configured to execute the program instructions to implement the above-mentioned method for repairing an uncorrectable error of a memory.
Embodiments of the present application provide a computer program or computer program product comprising instructions which, when executed, cause a computer to perform the above-mentioned method of repairing an uncorrectable error for a memory.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the specific application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.
Claims (12)
1. A method for repairing an uncorrectable error in a memory, comprising:
acquiring fault information of a memory, wherein the fault information comprises uncorrectable error information of the memory;
based on the information of the uncorrectable errors, when the fault is determined to be an uncorrectable error storm:
acquiring service load information, wherein the service load information comprises service load and/or service load increase;
and determining to repair the fault based on the service load and/or the service load increment.
2. The method of claim 1, wherein the determining to repair the fault based on the traffic load comprises:
and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.
3. The method of claim 1, wherein the determining to repair the fault based on the amount of traffic load increase comprises:
and when the service load increment is greater than or equal to a second threshold value, determining to repair the fault.
4. The method according to any one of claims 1-3, wherein the fault comprises a first type of fault, and wherein the first type of fault is repaired when the number of times of the first type of fault is greater than or equal to a third threshold, wherein the first type of fault comprises at least one of a Cell fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.
5. The method of any one of claims 1-4, further comprising, the fault information including a time of occurrence of the fault, the uncorrectable error storm being an uncorrectable error occurring a number of times greater than or equal to a first number of times within a first length of time.
6. A processing apparatus, comprising:
the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring fault information of a memory, and the fault information comprises uncorrectable error information of the memory;
based on the information of the uncorrectable errors, when the fault is determined to be an uncorrectable error storm;
acquiring service load information, wherein the service load information comprises service load and/or service load increase;
and the determining module is used for determining to repair the fault based on the service load and/or the service load increment.
7. The apparatus of claim 6, wherein the determining module is specifically configured to:
and when the traffic load is greater than or equal to a first threshold value, determining to repair the fault.
8. The apparatus of claim 6, wherein the determining module is specifically configured to:
and when the service load increment is larger than or equal to a second threshold value, determining to repair the fault.
9. The apparatus of any of claims 6-8, wherein the fault comprises a first type of fault, and wherein the first type of fault is repaired when a number of times the first type of fault is greater than a fourth threshold, the first type of fault comprising at least one of a Cel fault, a Row fault, a Col fault, a Bank fault, a Device fault, a Rank fault, or a Dimm fault.
10. The apparatus of any one of claims 6-9, further comprising:
and the setting module is used for determining the uncorrectable errors with the times larger than or equal to the preset times within the preset time length as the patrol uncorrectable error storm.
11. A computing device comprising a memory and a processor, wherein the memory has stored therein instructions that, when executed by the processor, cause the method of any of claims 1-5 to be implemented.
12. A computing device comprising a memory, a processor, and a BMC, wherein at least one of the processor or the BMC is operable to perform the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210797617.5A CN115168088A (en) | 2022-07-08 | 2022-07-08 | Method and device for repairing uncorrectable errors of memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210797617.5A CN115168088A (en) | 2022-07-08 | 2022-07-08 | Method and device for repairing uncorrectable errors of memory |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115168088A true CN115168088A (en) | 2022-10-11 |
Family
ID=83490635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210797617.5A Pending CN115168088A (en) | 2022-07-08 | 2022-07-08 | Method and device for repairing uncorrectable errors of memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115168088A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115495278A (en) * | 2022-11-14 | 2022-12-20 | 阿里巴巴(中国)有限公司 | Exception repair method, device and storage medium |
CN115686901A (en) * | 2022-10-25 | 2023-02-03 | 超聚变数字技术有限公司 | Memory fault analysis method and computer equipment |
CN116126581A (en) * | 2023-04-10 | 2023-05-16 | 阿里云计算有限公司 | Memory fault processing method, device, system, equipment and storage medium |
-
2022
- 2022-07-08 CN CN202210797617.5A patent/CN115168088A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115686901A (en) * | 2022-10-25 | 2023-02-03 | 超聚变数字技术有限公司 | Memory fault analysis method and computer equipment |
CN115686901B (en) * | 2022-10-25 | 2023-08-04 | 超聚变数字技术有限公司 | Memory fault analysis method and computer equipment |
CN115495278A (en) * | 2022-11-14 | 2022-12-20 | 阿里巴巴(中国)有限公司 | Exception repair method, device and storage medium |
CN115495278B (en) * | 2022-11-14 | 2023-03-31 | 阿里巴巴(中国)有限公司 | Exception repair method, device and storage medium |
CN116126581A (en) * | 2023-04-10 | 2023-05-16 | 阿里云计算有限公司 | Memory fault processing method, device, system, equipment and storage medium |
CN116126581B (en) * | 2023-04-10 | 2023-09-01 | 阿里云计算有限公司 | Memory fault processing method, device, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115168088A (en) | Method and device for repairing uncorrectable errors of memory | |
WO2021253708A1 (en) | Memory fault handling method and apparatus, device and storage medium | |
CN111786818B (en) | Block chain consensus node state monitoring method and device | |
CN114064333A (en) | Memory fault processing method and device | |
CN115168087B (en) | Method and device for determining repair resource granularity of memory failure | |
CN112799923B (en) | System abnormality cause determination method, device, equipment and storage medium | |
US20230185659A1 (en) | Memory Fault Handling Method and Apparatus | |
CN111626498A (en) | Equipment operation state prediction method, device, equipment and storage medium | |
CN113590429A (en) | Server fault diagnosis method and device and electronic equipment | |
CN115328684A (en) | Memory fault reporting method, BMC and electronic equipment | |
CN111147310A (en) | Log tracking processing method, device, server and medium | |
CN115705261A (en) | Memory fault repairing method, CPU, OS, BIOS and server | |
CN117971539A (en) | Memory fault processing method, computing equipment and management platform | |
WO2024066500A1 (en) | Memory error processing method and apparatus | |
CN114221807B (en) | Access request processing method, device, monitoring equipment and storage medium | |
CN116401085A (en) | Memory exception handling method, equipment and storage medium | |
CN115391075A (en) | Memory fault processing method, system and storage medium | |
CN115114066A (en) | Memory fault monitoring method, system, storage medium and equipment | |
CN117093389A (en) | Memory fault judging method, device, medium and electronic equipment | |
CN110489208B (en) | Virtual machine configuration parameter checking method, system, computer equipment and storage medium | |
CN115686901B (en) | Memory fault analysis method and computer equipment | |
CN114520808A (en) | Request processing method and device, electronic equipment and computer readable storage medium | |
WO2021103304A1 (en) | Data backhaul method, device, and apparatus, and computer-readable storage medium | |
CN112256467B (en) | Error type judging system and method thereof | |
CN117271193A (en) | Memory diagnosis method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20231117 Address after: 10/F, Chuangzhi Tiandi Building, Dongshigeng Street, Zhongdao East Road, Longzihu Wisdom Island, Zhengdong New District, Zhengzhou City, Henan Province, 450000 Applicant after: Henan Kunlun Technology Co.,Ltd. Address before: 450000 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu smart Island, Zhengdong New District, Zhengzhou City, Henan Province Applicant before: xFusion Digital Technologies Co., Ltd. |
|
TA01 | Transfer of patent application right |