CN115391075A - Memory fault processing method, system and storage medium - Google Patents

Memory fault processing method, system and storage medium Download PDF

Info

Publication number
CN115391075A
CN115391075A CN202210871008.XA CN202210871008A CN115391075A CN 115391075 A CN115391075 A CN 115391075A CN 202210871008 A CN202210871008 A CN 202210871008A CN 115391075 A CN115391075 A CN 115391075A
Authority
CN
China
Prior art keywords
fault
mode
target
memory
failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210871008.XA
Other languages
Chinese (zh)
Inventor
鲍全洋
张光彪
韦炜玮
李胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Kunlun Technology Co ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202210871008.XA priority Critical patent/CN115391075A/en
Publication of CN115391075A publication Critical patent/CN115391075A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • G11C29/4401Indication or identification of errors, e.g. for repair for self repair
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/76Masking faults in memories by using spares or by reconfiguring using address translation or modifications
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/78Masking faults in memories by using spares or by reconfiguring using programmable devices
    • G11C29/80Masking faults in memories by using spares or by reconfiguring using programmable devices with improved layout
    • G11C29/81Masking faults in memories by using spares or by reconfiguring using programmable devices with improved layout using a hierarchical redundancy scheme

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a memory fault processing method, a system and a storage medium, and relates to the technical field of memories. The method comprises the steps of receiving fault analysis information of a memory, wherein the fault analysis information comprises the fault severity of the memory and at least one physical position with a fault in the memory, and performing fault repair on at least one physical position of a target memory when the fault severity meets a preset condition. Because the fault analysis information comprises the fault severity of the memory, and only when the fault severity meets the preset condition, fault repair is performed on at least one physical position in the memory where a fault occurs, the fault analysis information is not only beneficial to reducing the influence of the memory fault on the system, but also beneficial to avoiding overuse of fault repair resources, so that when the fault severity of the memory is predicted to be more severe in the later period, no enough fault repair resources are available for fault repair, and the utilization rate of the fault repair resources is improved.

Description

Memory fault processing method, system and storage medium
Technical Field
The present application relates to the field of memory technologies, and in particular, to a method, a system, and a storage medium for processing a memory failure.
Background
The memory is an important storage unit of the computer device, and temporarily stores operation data in the processor and data exchanged with an external memory such as a hard disk. As the memory capacity increases, the basic failure rate of the memory increases, which results in an increase in the impact of memory failure on the system of the computer device.
In the related art, in order to reduce the influence of a memory failure on a system, an error correction method such as Parity (Parity), error Checking and Correcting (ECC) is generally adopted to check a memory and correct a checked error. However, the error correction capability of these error correction algorithms is limited, and if the error correction capability of the error correction algorithms is exceeded, error correction failure may occur, which may cause a serious failure of the system, for example, a downtime, resulting in data loss in the memory.
Therefore, how to more effectively reduce the influence of the memory failure on the system becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a memory fault processing method, a system and a storage medium, which can repair faults of a memory with substandard health degree before the memory fault affects the system, so as to avoid the memory fault and help to more effectively reduce the influence of the memory fault on the system.
In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:
in a first aspect, a method for processing a memory fault is provided, where the method includes: the method comprises the steps that a Central Processing Unit (CPU) receives fault analysis information of a target memory, wherein the fault analysis information comprises fault severity of the target memory and at least one fault position information; the at least one fault location information indicates at least one physical location in the target memory where the fault occurred; and under the condition that the severity of the fault meets a preset condition, the CPU performs fault repair on at least one physical position with the fault in the target memory.
According to the scheme, the fault severity of the target memory and at least one physical position with a fault in the target memory are obtained by receiving fault analysis information of the target memory, and when the fault severity meets a preset condition, fault repair is carried out on the at least one physical position. The fault analysis information comprises the fault severity of the target memory, and fault repair is performed on at least one physical position where a fault occurs in the target memory when the fault severity meets a preset condition, so that the influence of a memory fault on a system is reduced, excessive use of fault repair resources is avoided, and when the fault severity of the memory of the computer equipment is predicted to be more severe in a later period, no sufficient fault repair resources are available for fault repair, and the utilization rate of the fault repair resources is improved.
In a possible implementation manner, the fault analysis information further includes at least one fault mode to which at least one physical location belongs, and each fault mode is configured with a health degree threshold; the fault analysis information comprises the fault severity of the target memory, specifically, the fault analysis information comprises the fault severity of the target memory, and the fault severity is used for indicating the fault severity of the target memory; the severity of the fault satisfies a predetermined condition, including: the health value does not reach a health threshold of a target failure mode of the at least one failure mode; performing fault repair on at least one physical location where a fault occurs in a target memory, specifically including: and performing fault repair on a target physical position in at least one physical position with a fault in the target memory, wherein the target physical position belongs to a target fault mode in the at least one physical position.
In the implementation manner, a health degree threshold is set for each fault mode in advance, and at least one fault mode to which at least one physical location where a fault occurs in the target memory belongs is obtained, so that at least one health degree threshold is configured for at least one fault mode existing in the target memory, and whether fault repair is performed on the target physical location belonging to the target fault mode can be determined according to a size relationship between the health degree threshold of the target fault mode in the at least one fault mode and the health degree value of the target memory. On one hand, the health degree threshold is respectively configured for at least one fault mode, so that the health degree threshold is respectively configured for at least one physical position, and then the target physical position with the health degree not meeting the standard can be screened out according to the relation between the health degree threshold of each physical position and the health degree value of the target memory, therefore, the target physical position to be repaired is screened according to the meeting condition of each physical position, the accuracy of determining the target physical position is improved, the target physical position is ensured to be the physical position with the highest fault severity degree in at least one physical position, namely, the physical position which is needed to be repaired most urgently, and the influence of the memory fault on the system can be further reduced relative to the repair of the non-target physical position. On the other hand, only the target physical position is subjected to fault restoration, but not all the physical positions are subjected to fault restoration, so that the utilization rate of fault restoration resources is increased, limited fault restoration resources are reasonably utilized, and the condition that no fault restoration resources are used for restoration when the fault severity of the target memory is more serious in the future is avoided. In addition, because the fault mode and the fault repairing mode have a one-to-one correspondence relationship, the CPU can determine the fault repairing mode of the target determined physical position quickly and accurately according to the fault mode to which the target physical position belongs in the fault analysis information, which is beneficial to improving the fault repairing efficiency and the matching degree of the fault repairing mode and the physical position, and avoids that the redundant resources used by the fault repairing mode are not matched with the physical position, so that the invalid repairing is caused, and the repairing resources are wasted.
In another possible implementation, the at least one failure mode includes a first failure mode, and the method further includes: and the CPU determines the threshold value of the health degree of the first failure mode according to the repair cost of the first failure mode.
In this implementation manner, the health threshold of the first failure mode is determined according to the repair cost of the first failure mode, so that a proper health threshold can be set according to the repair cost of the first failure mode, for example, a relatively low health threshold is set when the repair cost is relatively high, so as to reduce the probability of performing failure repair as much as possible, a relatively high health threshold is set when the repair cost is relatively low, so as to improve the probability of performing failure repair as much as possible, which not only can avoid a relatively high cost for repairing a memory failure, but also can implement performing failure repair on a target memory at a relatively low cost, and is beneficial to reducing the probability of an uncorrectable error occurring in the future of the target memory at a relatively low cost.
In another possible implementation manner, the determining, by the CPU, the threshold of the health degree of the first failure mode according to the repair cost of the first failure mode includes: the CPU obtains a first number of available repair resources of a first failure mode; the CPU determines a health degree threshold value of the first fault mode according to the first quantity; the first amount is inversely proportional to the repair cost.
In this implementation, the health threshold of the first failure mode is determined according to the first number of available repair resources of the first failure mode, which is helpful for reasonably using the limited available repair resources, for example, when the available repair resources of the first failure mode are more, a larger health threshold is determined for the first failure mode, and when the available repair resources of the first failure mode are less, a smaller health threshold is determined for the first failure mode, so as to improve the utilization rate of the available repair resources, avoid the limited available repair resources from being over-used, and cause that there is not enough available repair resources to perform fault repair when the severity of the fault of the target memory is predicted to be more severe in the future.
In another possible implementation manner, the determining, by the CPU, the threshold of health degree of the first failure mode according to the repair cost of the first failure mode includes: and the CPU determines a health degree threshold value of the first failure mode according to the influence degree of the first failure mode on the performance of the computer equipment to which the target memory belongs, wherein the influence degree is in direct proportion to the repair cost.
In this implementation, the health threshold of the first failure mode is determined according to the influence degree of the first failure mode on the performance of the computer device, so that a proper health threshold is set for the first failure mode according to the influence degree of the first failure mode on the performance, for example, when the first failure mode has a large influence on the performance, a smaller health threshold is determined for the first failure mode, and when the failure repair mode used by the first failure mode has a small influence on the performance, a larger health threshold is determined for the first failure mode, so that the physical location on the target memory having the lowest influence on the performance can be repaired, which is beneficial to avoiding influencing the performance of the computer device for repairing the memory failure.
In another possible implementation, the at least one failure mode includes a first failure mode, and the method further includes: the CPU obtains a second quantity of available repair resources of the first failure mode; the CPU updates the health threshold of the first failure mode based on the second number.
In this implementation manner, in the above embodiment, the health degree threshold of the first failure mode is updated according to the second number of the available repair resources of the first failure mode, so that the health degree threshold of the first failure mode can accurately represent the current cost of repairing the physical location belonging to the first failure mode, and thus it is helpful to more accurately measure whether to perform failure repair on the physical location belonging to the first failure mode.
In another possible implementation manner, the fault analysis information further includes at least one fault repairing manner used by at least one physical location, each fault repairing manner is configured with a health threshold, the fault analysis information includes a fault severity of the target memory, specifically, the fault analysis information includes the fault severity of the target memory, and the fault severity is used to indicate the fault severity of the target memory; the severity of the fault satisfies preset conditions, including: the health degree value does not reach a health degree threshold value of a target fault repairing mode in at least one fault repairing mode; performing fault repair on at least one physical location where a fault occurs in a target memory, specifically including: and performing fault repair on a target physical location in at least one physical location with a fault in the target memory, wherein the target physical location is a physical location using a target fault repair mode in the at least one physical location.
In the implementation manner, a health degree threshold is set for each fault repairing manner in advance, and at least one fault repairing manner used by at least one physical location where a fault occurs in the target memory is obtained, so that the at least one fault repairing manner required by the target memory is respectively configured with one health degree threshold, and whether fault repairing is performed on the target physical location where the target fault repairing manner is used can be determined according to the relationship between the health degree threshold and the health degree value of the target fault repairing manner in the at least one fault repairing manner. On one hand, according to the relationship between the health degree threshold value of each physical position and the health degree value of the target memory, the target physical position with the health degree not reaching the standard is screened, so that the target physical position to be repaired is screened according to the standard reaching condition of each physical position, the accuracy of determining the target physical position is improved, the target physical position is ensured to be the physical position with the highest fault severity in at least one physical position, namely the physical position which needs to be repaired most urgently, and the influence of the memory fault on the system can be reduced compared with the repair of the non-target physical position by repairing the target physical position. On the other hand, only the target physical position is subjected to fault restoration, but not all the physical positions are subjected to fault restoration, so that the utilization rate of fault restoration resources is increased, limited fault restoration resources are reasonably utilized, and the condition that no fault restoration resources are used for restoration when the fault severity of the target memory is more serious in the future is avoided. In addition, the CPU can repair the target physical position according to the fault repair mode to which the target physical position belongs in the fault analysis information, the efficiency of fault repair and the matching degree of the fault repair mode and the physical position are improved, and the problems that the redundant resources used by the fault repair mode are not matched with the physical position, invalid repair is caused, and repair resources are wasted are avoided.
In another possible implementation manner, the fault handling method further includes: and under the condition that the fault severity meets the alarm condition, outputting alarm information, wherein the alarm information is used for prompting the risk of uncorrectable errors in the target memory.
In the implementation mode, the alarm condition is set, and when the fault severity of the target memory meets the alarm condition, the alarm information is output to prompt the user that the target memory has a fault risk, so that the user can conveniently determine whether to process the target memory according to the actual condition.
In a second aspect, a method for processing a memory failure is provided, where the method includes: the out-of-band controller acquires error information of a target memory; the out-of-band controller determines fault analysis information of the target memory according to the error information, wherein the fault analysis information comprises the fault severity of the target memory and at least one fault position information; the at least one fault location information indicates at least one physical location in the target memory where the fault occurred; the out-of-band controller sends fault analysis information to a Central Processing Unit (CPU); and the fault analysis information is used for carrying out fault repair on at least one physical position with a fault in the target memory by the CPU under the condition that the severity of the fault meets a preset condition.
In the scheme, the out-of-band controller determines fault analysis information of the target memory according to the error information of the target memory, wherein the fault analysis information comprises the fault severity of the target memory and at least one physical position with a fault in the target memory, and sends the fault analysis information to the CPU, so that the CPU can determine whether to perform fault repair on the at least one physical position with the fault in the target memory indicated by the at least one fault position information in the fault analysis information based on whether the fault severity in the fault analysis information meets a preset condition, which is not only beneficial to reducing the influence of the memory fault on a system, but also beneficial to optimizing a fault repair mode by the CPU, for example, the fault repair is performed only when the fault severity meets the preset condition, and the utilization rate of the CPU on fault repair resources is improved.
In a possible implementation manner, the failure analysis information includes a failure severity of the target memory, specifically, the failure analysis information includes a failure severity of the target memory, and the failure severity is used for indicating the failure severity of the target memory; the fault analysis information also comprises at least one fault mode to which at least one physical position belongs, and each fault mode is configured with a health degree threshold value; the fault analysis information is specifically used for performing fault repair on a target physical location in at least one physical location where a fault occurs in the target memory by the CPU under the condition that the health value does not reach a health threshold of a target fault mode in the at least one fault mode, where the target physical location is a physical location belonging to the target fault mode in the at least one physical location.
In the implementation mode, at least one fault mode to which at least one physical position belongs is analyzed and sent to the CPU, on one hand, the CPU can set a health degree threshold value for each fault mode in advance, and a health degree threshold value is configured for each physical position, so that whether the health degree of each physical position reaches the standard or not can be judged independently, and then fault repairing can be carried out only on target physical positions on the target memory, wherein the health degree of each physical position does not reach the standard, and the fault repairing rationality and the cost performance are improved. On the other hand, the fault modes and the fault repairing modes have one-to-one correspondence, so that the CPU can rapidly and accurately determine the fault repairing mode of the physical position according to the fault mode to which the physical position belongs, the efficiency of fault repairing is improved, the matching degree of the fault repairing mode and the physical position is improved, and the situation that redundant resources used by the fault repairing mode are not matched with the physical position, invalid repairing is caused, and repairing resources are wasted is avoided.
In another possible implementation manner, the method further includes: and the out-of-band controller carries out clustering processing on the at least one fault position information to obtain at least one fault mode.
In the implementation mode, clustering processing is performed on at least one physical location where a fault occurs in the target memory through clustering processing performed on at least one fault location information, so as to obtain at least one fault mode, and thus, one fault mode can indicate a physical area where a plurality of physical locations belonging to the one fault mode are located, and further, when fault repair is performed according to the fault mode, centralized repair can be performed on the plurality of physical locations on the physical area indicated by the fault mode, which is beneficial to improving repair efficiency, and physical locations which are covered by fault repair and are prone to faults are performed each time.
In another possible implementation, the failure mode includes any one of: page faults, bit faults, row faults, and storage array faults.
In this implementation manner, by setting any one of the failure modes including a page failure, a bit failure, a row failure, and a storage array failure, the failure mode to which the physical location belongs can be accurately predicted according to the failure location information indicating the physical location, which is helpful for improving accuracy and convenience of prediction of the failure mode to which the physical location belongs.
In another possible implementation manner, the fault analysis information includes a fault severity of the target memory, specifically, the fault analysis information includes a fault severity of the target memory, and the fault severity is used for indicating the fault severity of the target memory; the fault analysis information further comprises at least one fault repairing mode used by at least one physical position, and each fault repairing mode is configured with a health degree threshold value; the fault analysis information is specifically used for the CPU to perform fault restoration on a target physical location in at least one physical location in the target memory where a fault occurs, when the health degree value does not reach a health degree threshold of a target fault restoration method in the at least one fault restoration method, where the target physical location is a physical location in the at least one physical location where the target fault restoration method is used.
In this kind of implementation, through having analyzed at least one fault repair mode that at least one physical location used and sent CPU for, on the one hand, make CPU can be through the health degree threshold value that sets up for every fault repair mode in advance, realize for every physical location configuration a health degree threshold value, thereby can judge alone whether the health degree of every physical location is up to standard, and then can only carry out fault repair to the target physical location that the health degree is not up to standard on the target memory, help improving fault repair's rationality and price/performance ratio. On the other hand, the CPU can quickly and accurately repair the fault of the physical position according to the fault repairing mode needed by the physical position, the efficiency of fault repairing is improved, the matching degree of the fault repairing mode and the physical position is improved, and the problems that the redundant resources used by the fault repairing mode are not matched with the physical position, invalid repairing is caused, and repairing resources are wasted are avoided.
In another possible implementation manner, the method further includes: and the out-of-band controller carries out clustering processing on the at least one fault position information to obtain at least one fault repairing mode.
In this implementation, clustering is performed on at least one physical location where a fault occurs in the target memory by using at least one fault location information, so that at least one fault recovery mode is obtained by performing clustering on at least one physical location where a fault occurs in the target memory, and thus, one fault recovery mode can indicate a physical area where a plurality of physical locations belonging to the one fault recovery mode are located, and further, when fault recovery is performed according to the fault recovery mode, centralized recovery can be performed on the plurality of physical locations on the physical area indicated by the fault recovery mode, which is beneficial to improving recovery efficiency, and the number of the physical locations where a fault occurs, which are covered by fault recovery, is performed each time.
In another possible implementation manner, the fault repairing manner includes any one of the following: page isolation, bit isolation, row isolation, and storage array isolation.
In the implementation mode, the fault repairing mode is set to any one of page isolation, bit isolation, row isolation and storage array isolation, so that the fault repairing mode used by the physical position can be accurately predicted according to the fault position information indicating the physical position, and the accuracy and the convenience of the prediction of the fault repairing mode used by the physical position are improved.
In another possible implementation manner, the determining, by the out-of-band controller, the fault analysis information of the target memory according to the error information of the target memory includes: and the out-of-band controller inputs the error information of the target memory into the machine learning model to obtain the fault analysis information of the target memory output by the machine learning model.
In the implementation mode, because the machine learning model is trained in advance, the fault analysis information of the target memory can be obtained by inputting the error information of the target memory into the machine learning model without the participation of a user in calculation and data processing, so that the speed of fault prediction is increased, manual errors caused by user operation can be avoided, and the accuracy of the fault analysis information is improved.
In a third aspect, a memory failure processing apparatus is provided, where the apparatus includes: the functional units for executing any one of the methods provided by the first aspect, wherein the actions performed by the respective functional units are implemented by hardware or by hardware executing corresponding software. For example, the memory failure processing device may be a processor firmware, and specifically includes: a receiving unit and a processing unit; the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving fault analysis information of a target memory, and the fault analysis information comprises the fault severity of the target memory and at least one fault position information; the at least one fault location information indicates at least one physical location in the target memory where the fault occurred; and the processing unit is used for carrying out fault repair on at least one physical position with a fault in the target memory under the condition that the severity of the fault meets a preset condition.
In a fourth aspect, a memory failure handling apparatus is provided, the apparatus including: the units for performing the functions of any one of the methods provided by the second aspect, the actions performed by the respective functional units are implemented by hardware or by hardware executing corresponding software. For example, the memory fault processing apparatus may be an out-of-band management module, and specifically includes an obtaining unit, a determining unit, and a sending unit, where the obtaining unit is configured to obtain error information of a target memory; the determining unit is used for determining fault analysis information of the target memory according to the error information, wherein the fault analysis information comprises the fault severity of the target memory and at least one fault position information; the at least one fault location information indicates at least one physical location in the target memory where the fault occurred; a sending unit, configured to send failure analysis information to a central processing unit CPU; and the fault analysis information is used for carrying out fault repair on at least one physical position with a fault in the target memory by the CPU under the condition that the severity of the fault meets a preset condition.
In a fifth aspect, a memory failure handling system is provided, including: processor firmware and out-of-band management modules. The processor firmware is configured to perform any one of the methods provided by the first aspect, and the out-of-band management module is configured to perform any one of the methods provided by the second aspect.
In a sixth aspect, there is provided a computer device comprising: the method comprises the following steps: the device comprises a processor and a memory, wherein the processor is connected with the memory. The memory is used for storing computer-executable instructions, and the processor executes the computer-executable instructions stored by the memory, thereby implementing any one of the methods provided by the first aspect, or implementing any one of the methods provided by the second aspect.
In a seventh aspect, a chip is provided, for example, a chip of processor firmware or a chip of an out-of-band management module, where the chip includes: a processor and an interface circuit; the interface circuit is used for receiving the code instruction and transmitting the code instruction to the processor; a processor for executing code instructions to perform any of the methods provided by the first aspect above, or to perform any of the methods provided by the second aspect above.
In an eighth aspect, a computer-readable storage medium is provided, which stores computer-executable instructions that, when executed on a computer, cause the computer to perform any one of the methods provided by the first aspect, or perform any one of the methods provided by the first aspect.
In a ninth aspect, there is provided a computer program product comprising computer executable instructions which, when executed on a computer, cause the computer to perform any one of the methods provided by the first aspect above, or to perform any one of the methods provided by the second aspect above.
For technical effects brought by any one of the design manners in the third aspect to the ninth aspect, reference may be made to technical effects brought by different design manners in the first aspect or the second aspect, and details are not described here.
Drawings
Fig. 1 is an architecture diagram of a computer device according to an embodiment of the present application;
fig. 2 is a flowchart of a memory fault processing method according to an embodiment of the present disclosure;
fig. 3 is a flowchart of another memory fault processing method according to an embodiment of the present disclosure;
fig. 4 is a flowchart of another memory fault processing method according to an embodiment of the present disclosure;
fig. 5 is a flowchart of another memory fault processing method according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a memory fault handling apparatus according to an embodiment of the present application;
fig. 7 is a schematic diagram of another memory failure processing apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Where in the description of the present application, "/" indicates a relationship where the objects associated before and after are an "or", unless otherwise stated, for example, a/B may indicate a or B; in the present application, "and/or" is only an association relationship describing an association object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural.
Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.
In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. Also, in the embodiments of the present application, the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.
First, an application scenario of the embodiment of the present application is exemplarily described.
The memory failure method in the embodiment of the present application can be applied to memories such as a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), and the like, and the method in the embodiment of the present application does not limit the type of the memory.
At present, most processors of computer equipment support an ECC error correction method, and each time a memory executes a read/write task, the processor performs ECC error correction on the memory. However, ECC can only be used to correct 1 bit (bit) errors and detect 2 bit errors, and errors above 1 bit cannot be corrected, and errors above 2 bits cannot be guaranteed to be detected. Specifically, errors that can be corrected are called Correctable Errors (CE), and if the capability of an error correction algorithm is exceeded, for example, when a large range of multiple bits of a memory fail, error correction fails, and an uncorrectable error (UCE) is generated, which causes a serious failure of a system of a computer device, such as a downtime, and data in the memory is lost.
In view of this, in the following examples, embodiments of the present application provide a memory fault handling method, which obtains a fault severity of a target memory and at least one physical location where a fault occurs in the target memory by receiving fault analysis information of the target memory, and performs fault repair on the at least one physical location when the fault severity satisfies a preset condition. Because the fault analysis information includes the fault severity of the target memory, and when the fault severity meets the preset condition, fault repair is performed on at least one physical position where a fault occurs in the target memory, the method and the device are not only beneficial to reducing the possibility of uncorrectable errors occurring in the target memory in the future, and further reducing the influence of memory faults on a system, but also beneficial to avoiding excessive use of fault repair resources, so that when the fault severity of the target memory is predicted to be more serious in the later period, no sufficient fault repair resources are available for fault repair, and the utilization rate of the fault repair resources is improved.
Next, an exemplary description is given of a system architecture according to an embodiment of the present application.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present disclosure. The system architecture diagram is illustrative of a computer device. Referring to fig. 1, a hardware portion of the computer device mainly includes a processor, an out-of-band controller, and a memory, and a software portion mainly includes an out-of-band management module, processor firmware, and an Operating System (OS) management unit. The out-of-band management module is located in the out-of-band controller, the OS management unit is located in the processor, and the processor firmware may be located in the processor (as shown in fig. 1), or the processor firmware may be located in a firmware chip (not shown in fig. 1) outside the processor. The out-of-band management module may be a management unit of the non-service module. For example, the out-of-band management module may be completely independent from the operating system of the computer device, and may communicate with a basic input output system (bios) and an OS (or OS management unit) through an out-of-band management interface of the computer device.
Illustratively, the out-of-band management module may include a monitoring management unit external to the computer device, a management system in a management chip outside the processor, a computer device baseboard management unit (BMC), a System Management Module (SMM), and the like. It should be noted that, the specific form of the out-of-band management module in the embodiments of the present application is not limited, and the above description is only an example. In the following embodiments, only the out-of-band management module is taken as the BMC for example.
It should be noted that, the out-of-band management module described in the following embodiments performs a certain step (e.g., the following S201), and it is understood that: the out-of-band controller calls the out-of-band management module to execute the step.
For example, the processor Firmware (also referred to as a processor Firmware program) may be Firmware, basic Input Output System (BIOS), management Engine (ME), microcode, or Intelligent Management Unit (IMU). It should be noted that, the specific form of the processor firmware in the embodiments of the present application is not limited, and the above is only an exemplary description. In the following embodiments, only the processor firmware is taken as an example of the BIOS.
It should be noted that, the processor firmware described in the following embodiments executes a certain step (e.g., S301 below), which may be understood as: the CPU calls the processor firmware to perform this step.
The memory, also called as a memory or a main memory, is installed in a memory slot on a motherboard of the computer device, and the memory controller communicate with each other through a memory channel (channel). The memory has at least one memory column (rank), each memory column is located on a face of the memory respectively, each memory column includes at least one sub memory column (subrank), the memory column or the sub memory column includes a plurality of memory chips (device), each memory chip is divided into a plurality of memory array groups (bank), each memory array group includes a plurality of memory arrays (bank), each memory array is divided into a plurality of memory cells (cell), each memory cell has a row (row) address and a column (column) address, and each memory cell includes one or more bits. In one division mode, the memory can be sequentially divided into a memory chip, a memory array group, a memory array, memory rows/memory columns, memory cells, and bits from an upper level to a lower level, where the addresses of the memory particles, the memory array group, the memory array, the memory rows, the memory columns, the memory cells, and the bits stored in the memory are real physical addresses. In another division mode, the CPU divides the memory chip into a plurality of memory pages (pages) based on a paging mechanism, where the addresses of the memory pages are virtual addresses, and the virtual addresses are converted into real physical addresses.
It should be noted that the system architecture and the application scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
For convenience of understanding, the memory failure handling method provided by the present application is exemplarily described below with reference to the accompanying drawings.
The following embodiments of the present application will be described in an exemplary manner by dividing into two parts.
First, a process of determining, by the out-of-band management module, fault analysis information of a target memory based on error information of the target memory is described with reference to fig. 2, where the fault analysis information includes a fault severity and at least one physical location of a fault occurring in the target memory.
The second part, with reference to fig. 3 to 5, describes a process in which the processor firmware receives the failure analysis information of the first part, and determines whether to repair at least one physical location in the target memory where a failure occurs based on the severity of the failure of the target memory.
Fig. 2 is a flow chart illustrating a memory failure handling method according to an example embodiment. Illustratively, the method includes S201-S203.
S201: and the out-of-band management module acquires the error information of the target memory.
Optionally, the error information of the target memory includes at least one of a category of the CE, a time of occurrence of the CE, a number of times of error of the CE, physical address information of the CE, a number of times of error of memory patrol, a memory patrol error row address, a memory patrol error most row address, a type of an uncorrectable error, a state of an uncorrectable error, an occurrence time of an uncorrectable error, a number of times of error of an uncorrectable error, physical address information of an uncorrectable error, ECC error correction register information, machine Check Architecture (MCA) register information, MCA report (report) information, and Mode Register (MR) register information.
The CE category includes patrol correctable errors, read-write correctable errors, move correctable errors, mirror image write-back correctable errors, and the like.
The types of uncorrectable errors include a burst fatal error, a selective processing (SRAO) error, an unnecessary processing (UCNA) error, a required processing (SRAR) error, and a patrol uncorrectable error.
Because the number of times of correctable errors of the memory is large, and correctable errors usually occur before uncorrectable errors occur in the memory, the fault analysis information of the target memory is generated based on the correctable error information of the target memory, which is helpful for improving the number of basic data for prediction, and is further helpful for improving the accuracy of the fault analysis information of the target memory.
In addition, the location of the memory where the uncorrectable error has occurred is more prone to generate the uncorrectable error than other locations, and therefore, the generation of the failure analysis information of the target memory based on the uncorrectable error information of the target memory is helpful for improving the accuracy of the failure analysis information of the target memory.
In some embodiments, the error information of the target memory is error information existing in the target in a first time period, where the duration of the first time period may be 5 days, 15 days, 30 days, and the like, that is, the first time period may be 5 days, 15 days, 30 days, and the like before the time point of acquiring the error information.
How to obtain the error information of the target memory includes various implementation manners, which are exemplarily described in the following manner 1 and manner 2.
Mode 1: and under the condition that the target memory is detected to have errors, the out-of-band management module acquires the error information of the target memory.
In the related art, each time the memory executes a read/write task, the CPU or the processor firmware performs error detection on the memory based on an ECC error correction method, and corrects the detected error if the error is detected.
In some embodiments, upon detecting the error, error information for the target memory is sent by the CPU or processor firmware to the out-of-band management module. In other embodiments, after an error is detected and corrected, the error information of the target memory is sent by the CPU or processor firmware to the out-of-band management module.
It is understood that in mode 1, the target memory is the memory in which the error is detected on the computer device.
Mode 2: the out-of-band management module periodically acquires error information of the target memory according to a first preset period.
In some embodiments, the first preset period may be 5 days, 15 days, 30 days, and the like, which is not limited by the embodiment of the present application.
In some embodiments, the error correctable information of the target memory is periodically obtained, which may be that the out-of-band management module actively obtains from the CPU or the processor firmware, or that the CPU or the processor firmware periodically collects the error information of the target memory and then actively sends the error information to the out-of-band management module, which is not limited in this application.
It will be appreciated that in approach 2, the target memory may be any memory on the computer device.
In some embodiments, when the out-of-band management module obtains the error information of the target memory, it may also obtain target prediction information, so as to determine the fault analysis information of the target memory based on the error information of the target memory and the target prediction information. The target prediction information includes at least one of running state information of the target memory, intrinsic parameter information of a CPU of the computer device where the target memory is located, running state information of the CPU, and configuration information of processor firmware.
The running state information of the target memory comprises at least one item of occupancy rate information, temperature information and running program information of the target memory. The intrinsic parameter information of the target memory comprises at least one of the type, manufacturer identification, capacity, process generation, main frequency, serial number, minimum voltage, memory column number and bit width of the memory. The inherent parameter information of the CPU comprises at least one item of manufacturer identification, category, model, dominant frequency and process generation of the CPU. The running state information of the CPU comprises at least one of CPU occupancy rate information, temperature information and running program information. The configuration information of the processor firmware includes configuration items, for example, the configuration items may be single refresh frequency, double refresh frequency, etc.
In some embodiments, when the failure analysis information is determined according to the error information and the target prediction information of the target memory, a two-dimensional or multidimensional combination parameter may be constructed according to the error information and the target prediction information, the combination parameter may be in the form of (error information, inherent parameter information of the target memory, \8230; running state information of a CPU), (CE occurrence time, number of CE errors, \8230;, CE category), and the like, and then the failure analysis information is determined according to the combination parameter, so that the health of the target memory in the failure aspect may be evaluated based on multiple dimensions, which is further helpful for improving the accuracy of the failure analysis information.
It should be noted that the principle of obtaining the target prediction information is the same as the principle of obtaining the error information of the target memory, and therefore, how to obtain the target prediction information, reference may be made to the obtaining process and the related description of the error information, and details are not repeated here.
In this embodiment, by obtaining the target prediction information and using the target prediction information to generate the fault analysis information, the richness of the prediction parameters is improved, and the failure severity of the target memory is facilitated to be evaluated from multiple aspects, so that the predicted failure severity and the accuracy of at least one physical location where a fault occurs in the target memory are improved, the accuracy of performing failure repair on the target memory is facilitated to be improved, and further, the influence of a memory fault on a system is facilitated to be avoided.
S202: and the out-of-band management module determines fault analysis information of the target memory according to the error information, wherein the fault analysis information comprises fault severity and at least one fault position information.
Optionally, the fault analysis is performed on the target memory based on the error information of the target memory to obtain the fault severity of the target memory. The failure severity represents the influence degree of the memory failure of the target memory on the system, specifically, the more serious the failure degree is, the worse the health condition is, the higher the possibility of the target memory for generating an uncorrectable error in the future is, the greater the influence on the system is, and conversely, the smaller the influence on the system is.
Optionally, the target memory is subjected to fault analysis based on the error information of the target memory, and at least one fault location information, for example, fault location information 1, \8230;, fault location information N shown in fig. 2, is obtained. In the target memory, the physical location indicated by the failure location information is more prone to uncorrectable errors relative to other physical locations, so that the failure repair is performed on at least one physical location, and the influence of a memory failure on the system can be more effectively avoided.
In some embodiments, the target memory is a memory chip, a memory array set, a memory array, a row/column, a memory cell, a bit, etc. in sequence from upper level to lower level. The physical locations may be different levels of objects such as pages, bits, rows, storage arrays, etc. on the target memory. Wherein the level of the physical location is determined according to the object with the smallest granularity in the fault location information. For example, when the fault location information includes a processor identifier (CPU ID), a Channel identifier (Channel ID), a memory identifier (Dimm ID), a Bank identifier (Rank ID), a sub-Bank identifier (SubRank ID), a memory chip identifier (Device ID), a memory array group identifier (Bank group ID), a memory array identifier (Bank ID), and a row (row) address, an object with the smallest granularity is a row corresponding to the row address, and at this time, the physical location indicated by the fault location information is a row.
It should be noted that the physical location indicated by the failure location information may also be a memory cell (cell), a column (column), a bank (bankgroup), a memory chip (device), and the like on the target memory.
It should be noted that the levels of the physical locations indicated by different fault location information may be the same, for example, the level of each physical location is a row, or may also be different, for example, the level of a part of the physical locations is a row (row), and the level of another part of the physical locations is a storage array (bank), which is not limited in this embodiment of the present application.
In some embodiments, the fault analysis information of the target memory may be obtained by clustering the error information of the target memory. For example, the physical locations where the errors occurred on the target memory are clustered, so as to obtain at least one fault location information and fault severity.
In other embodiments, S202 includes: and inputting the error information of the target memory into the machine learning model to obtain the fault analysis information of the target memory output by the machine learning model.
The machine learning model is a machine learning model trained in advance, the training process of the machine learning model can be iterative training through training samples and sample labels, wherein the training samples comprise error information of a plurality of memories, and the sample labels comprise the fault severity of each training sample.
In this embodiment, since the machine learning model is trained in advance, the fault analysis information of the target memory can be obtained by inputting the error information of the target memory into the machine learning model, and a user does not need to participate in calculation and data processing, which is not only helpful for improving the speed of fault prediction, but also can avoid manual errors caused by user operation, thereby improving the accuracy of the fault analysis information.
In some embodiments, the machine Learning model may be a hierarchical threshold algorithm, or one or more of a machine Learning algorithm such as a random forest, gradient descent decision tree (GBDT), extreme gradient ascent (XGBoost), naive bayes, support Vector Machine (SVM), or one or more of a deep Learning algorithm such as a Convolutional Neural Network (CNN), long-short-term neural network (LSTM), or a Federated Optimization algorithm such as a mean average (FedAvg), a near-end Optimization algorithm (federoptimized Optimization algorithm, fedx), a user scenario-based Federated Learning algorithm (federal Learning for federal sustained release, federal learned classes(s).
In some embodiments, the fault severity and the at least one fault location information may be derived based on different parameter information. For example, when determining the severity of the fault, the error information and the target prediction information of the target memory are used, so that the severity of the fault of the target memory can be comprehensively evaluated from multiple aspects such as error conditions, operating conditions, inherent performance and the like, and the prediction accuracy of the severity of the fault is further improved. When at least one piece of fault location information is predicted, correctable error information and uncorrectable error information of a target memory are used, because both the correctable error information and the uncorrectable error information belong to historical error information of the target memory, and compared with other information, such as running state information, inherent parameter information and the like, the historical error information has higher correlation with a physical location of an uncorrectable error to be generated on the target memory, therefore, at least one physical location of the target memory where a fault is generated is determined by using the correctable error information and the uncorrectable error information of the target memory, and under the condition of ensuring the prediction accuracy, the complexity of parameters can be simplified, the determination difficulty can be reduced, the processing speed of the parameters can be increased, and the prediction speed can be increased.
Optionally, the fault analysis information includes fault severity specifically: the fault analysis information includes a health value indicating a severity of the fault for the target memory.
In one example, the health value is a probability value, e.g., 50%, 80%, 90%, etc. In another example, the health value is a score, e.g., 1, 2, 3, 4, 5, etc. For example, score 1 may characterize ultra-high risk, score 2 high risk, score 3 medium risk, score 4 low risk, and score 5 healthy.
It is understood that the relationship between the severity of the fault and the health value may be that the greater the health value is, the more mild the severity of the fault is, or may also be that the greater the health value is, the more severe the severity of the fault is, which is not limited in this embodiment of the present application. Hereinafter, the present embodiment will be described by taking as an example the case where the severity of a failure is more slight as the health value is larger.
Optionally, the fault analysis information further includes at least one fault mode to which the at least one physical location belongs, and each fault mode is configured with a health threshold; the fault analysis information is specifically used for the processor firmware to perform fault repair on a target physical location in at least one physical location under the condition that the health degree value does not reach a health degree threshold of a target fault mode in at least one fault mode, wherein the target physical location is a physical location belonging to the target fault mode in the at least one physical location.
By analyzing at least one fault mode to which at least one physical position belongs and sending the fault mode to the processor firmware, on one hand, the processor firmware can set a health degree threshold value for each fault mode in advance to realize that one health degree threshold value is matched for each physical position, so that whether the health degree of each physical position reaches the standard or not can be judged independently, and then fault repairing can be carried out only on target physical positions on the target memory, wherein the target physical positions on which the health degree does not reach the standard, and the fault repairing rationality and the cost performance are improved. On the other hand, the fault modes and the fault repairing modes have one-to-one correspondence, so that the processor firmware can rapidly and accurately determine the fault repairing mode of the physical position according to the fault mode of the physical position, the fault repairing efficiency and the matching degree of the fault repairing mode and the physical position are improved, and the redundant resources used by the fault repairing mode are prevented from being unmatched with the physical position, so that the invalid repairing is caused, and the repairing resources are wasted.
Optionally, the memory failure processing method further includes: and the out-of-band management module carries out clustering processing on at least one fault position information to obtain at least one fault mode. Wherein one failure mode is used to indicate an area where a plurality of physical locations belonging to the one failure mode are located.
In some embodiments, clustering is performed on at least one physical location where an uncorrectable error is to occur on a target memory by clustering at least one piece of failure location information, so as to obtain at least one failure mode, thereby enabling one failure mode to indicate a physical area where multiple physical locations belonging to the one failure mode are located, and further enabling multiple physical locations on the physical area indicated by the failure mode to be collectively repaired when performing failure repair according to the failure mode, which is beneficial to improving repair efficiency, and facilitates the physical location covered by each failure repair and prone to failure.
Optionally, the failure mode comprises any of: page faults, bit faults, row faults, and storage array faults.
In the above embodiment, by setting the failure mode to include any one of a page failure, a bit failure, a row failure, and a storage array failure, the failure mode to which the physical location belongs can be accurately determined according to the first address information indicating the physical location, which is helpful for improving accuracy and convenience of prediction of the failure mode to which the physical location belongs.
Optionally, the fault analysis information further includes at least one fault recovery mode used by at least one physical location, and each fault recovery mode is configured with a health degree threshold; the fault analysis information is specifically used for the processor firmware to perform fault repair on a target physical location in the at least one physical location under the condition that the health degree value does not reach a health degree threshold of a target fault repair mode in the at least one fault repair mode, wherein the target physical location is a physical location using the target fault repair mode in the at least one physical location.
Through analyzing at least one fault repairing mode used by at least one physical position and sending the fault repairing mode to the processor firmware, on one hand, the processor firmware can match a health threshold value for each physical position through the health threshold value set for each fault repairing mode in advance, so that whether the health of each physical position reaches the standard or not can be judged independently, and then fault repairing can be carried out only on target physical positions on the target memory, wherein the target physical positions are not reached to the health, and the fault repairing rationality and the cost performance are improved. On the other hand, the processor firmware can rapidly and accurately repair the fault of the physical position according to the fault repair mode required by the physical position, the efficiency of fault repair and the matching degree of the fault repair mode and the physical position are improved, and the problems that the redundant resources used by the fault repair mode are not matched with the physical position, invalid repair is caused, and repair resources are wasted are avoided.
Optionally, the memory failure processing method further includes: and the out-of-band management module carries out clustering processing on at least one fault position information to obtain at least one fault repairing mode. Wherein, one failure recovery mode is used for indicating the area where a plurality of physical positions using the one failure recovery mode are located.
Clustering is carried out through at least one fault position information, at least one physical position with a fault in a target memory is clustered to obtain at least one fault repairing mode, so that one fault repairing mode can indicate the physical area where a plurality of physical positions belonging to the one fault repairing mode are located, further, when fault repairing is carried out according to the fault repairing mode, a plurality of physical positions on the physical area indicated by the fault repairing mode can be repaired in a centralized mode, the repairing efficiency is improved, and the physical positions which are covered by fault repairing and easily break down are carried out at each time.
Optionally, the fault recovery means includes any one of: page isolation, bit isolation, row isolation, and storage array isolation.
In the above embodiment, by setting any one of the fault repairing modes including page isolation, bit isolation, row isolation and storage array isolation, the fault repairing mode used by the physical location can be accurately predicted according to the fault location information indicating the physical location, which is helpful for improving the accuracy and convenience of predicting the fault repairing mode used by the physical location.
S203: the out-of-band management module sends failure analysis information to the processor firmware. And the fault analysis information is used for carrying out fault repair on at least one physical position with a fault in the target memory by the processor firmware under the condition that the fault severity meets the preset condition.
In the above embodiment, the fault analysis information of the target memory is determined according to the fault information of the target memory, where the fault analysis information includes the fault severity of the target memory and at least one physical location where a fault occurs in the target memory, and the fault analysis information is sent to the processor firmware, so that the processor firmware can determine whether to perform fault repair on the at least one physical location where a fault occurs in the target memory, which is indicated by the at least one fault location information in the fault analysis information, based on whether the fault severity satisfies a preset condition, which is not only beneficial to reducing the possibility of an uncorrectable error occurring in the target memory in the future, and further beneficial to the processor firmware to optimize a fault repair mode, for example, the fault repair is performed only when the fault severity satisfies the preset condition, and thus the utilization rate of the processor firmware on fault repair resources is improved.
Fig. 3 is a flow chart illustrating a method of memory fault handling in accordance with an exemplary embodiment. Illustratively, the method includes S301-S303.
S301: and the processor firmware receives fault analysis information sent by the out-of-band management module, wherein the fault analysis information comprises the fault severity of the target memory and at least one fault position information.
It should be noted that, regarding the description about the severity of the fault and the at least one fault location information, reference may be made to the above S202, which is not described in detail herein.
S302: and under the condition that the severity of the fault meets a preset condition, the processor firmware carries out fault repair on at least one physical position with the fault in the target memory.
Optionally, the severity of the fault satisfies a preset condition, including: the health value is less than or equal to the target health threshold. And the target health threshold is a health threshold of the target memory.
The target health degree threshold is used for representing the minimum requirement of the health degree value, when the health degree value is smaller than or equal to the target health degree threshold, the health degree value is represented not to reach the standard, and when the health degree value is larger than the target health degree threshold, the health degree value is represented to reach the standard.
In some embodiments, the processor firmware determines the target health threshold based on a repair cost of the target memory. Specifically, the larger the cost for repairing the target memory is, the smaller the target health threshold should be set, so as to reduce the number of times of repairing the target memory. Otherwise, the larger the target health degree threshold value should be set, so as to increase the number of times of repairing the target memory, thereby reducing the risk of uncorrectable errors occurring in the target memory in the future at a lower repair cost.
How to determine the target health threshold according to the repair cost of the target memory includes various implementation manners, which are exemplarily described by two embodiments below.
In some embodiments, the target health threshold is determined based on the number of currently remaining failover resources of the computer device. If the number of the remaining fault repair resources is small, after the target memory is subjected to fault repair, if the severity of future faults of other memories of the computer device or the target memory is more serious, for example, the predicted health degree value is smaller, there will be insufficient fault repair resources to repair the more serious errors, and at this time, the cost for repairing the target memory is relatively high, so that the target health degree threshold value should be set relatively small. On the contrary, if the number of the current remaining fault repair resources is large, the cost for repairing the target memory is low, and the target health degree threshold value should be set to be relatively large.
In other embodiments, the target health threshold is determined based on a degree of impact of the repair target memory on performance of the computer device. If the impact of the fault repair mode that the target memory needs to use on the performance (such as bandwidth) of the computer device is larger, the target health degree threshold value should be set relatively small, and conversely, the target health degree threshold value should be set relatively large.
In some embodiments, the processor firmware performs fault repair on the at least one physical location, and in particular, the processor firmware performs isolated replacement on the at least one physical location using a redundancy resource (i.e., a fault repair resource), wherein the redundancy resource may be a redundancy page, a redundancy bit, a redundancy row, a redundancy array, or the like.
In some embodiments, the processor firmware performs fault recovery on all of the at least one physical location, or may also perform fault recovery on part of the at least one physical location, which is not limited in this application.
In the above embodiment, the failure severity of the target memory and the at least one physical location where the failure occurs in the target memory are obtained by receiving the failure analysis information of the target memory, and when the failure severity satisfies the preset condition, the failure is repaired for the at least one physical location. Because the fault analysis information includes the fault severity of the target memory, and when the fault severity meets the preset condition, fault repair is performed on at least one physical position where a fault occurs in the target memory, the method and the device are not only beneficial to reducing the possibility of uncorrectable errors occurring in the target memory in the future, and further reducing the influence of memory faults on a system, but also beneficial to avoiding excessive use of fault repair resources, so that when the fault severity of the target memory is predicted to be more serious in the later period, no sufficient fault repair resources are available for fault repair, and the utilization rate of the fault repair resources is improved.
S303: and under the condition that the severity of the fault meets the alarm condition, the processor firmware outputs alarm information, and the alarm information is used for prompting that the target memory has the risk of uncorrectable errors.
It is understood that S303 is an optional step. Alternatively, after the processor firmware executes S302, S303 may be skipped and the process ends directly.
Wherein, the fault severity satisfies the alarm condition, including: the health value belongs to an alarm threshold interval, i.e. the health value is less than or equal to the alarm threshold. The alarm threshold interval is an interval between the target health degree threshold and the alarm threshold. The alarm threshold is greater than the target health threshold.
In some embodiments, the out-of-band management module is configured with a multi-level decision condition, the multi-level threshold comprising a preset condition, an alarm condition, and an end condition. The preset condition is used for judging whether to send a repair request, the alarm condition is used for judging whether to output alarm information, and the end condition is used for judging whether to directly end. For example, when the severity of the fault meets the preset condition, only the repair enforcement request is sent, when the severity of the fault does not meet the preset condition and meets the alarm condition, the alarm information is output, and when the severity of the fault does not meet the preset condition and the alarm condition and meets the end condition, the process is directly ended, that is, the repair request is not sent, and the alarm information is not output.
In some embodiments, the alarm information includes a health value (or fault severity) and/or at least one fault location information. By outputting the health value to the user, the user is facilitated to know the health state of the target memory, so that appropriate operation is selected according to the health state, for example, whether to perform fault repair on the target memory, or perform data migration or memory replacement. On one hand, the user can conveniently perform fault processing only aiming at the fault position (namely the physical position indicated by the fault position information), for example, data migration is performed only on data in the fault position, which is beneficial to reducing the workload of fault processing. On the other hand, the user can conveniently select partial fault positions to carry out fault repair, more choices are provided for the user, and therefore fault prevention is carried out on the target memory more comprehensively.
In the embodiment, the alarm condition is set, and when the fault severity meets the alarm condition, the alarm information is output to prompt the user that the target memory has the fault risk, so that the user can conveniently determine whether to process the target memory according to the actual condition.
In the scheme of the memory fault processing method shown in fig. 3, whether to perform fault repair is determined according to the severity of the fault of the target memory and the preset condition, the present application also provides another memory fault processing method shown in fig. 4, which is different from the scheme shown in fig. 3, and in the scheme shown in fig. 4, whether to perform fault repair is determined according to the health degree value of the target memory and the health degree threshold of the fault mode. Hereinafter, only the differences between the two schemes will be specifically described, and the common points will not be described in detail.
Fig. 4 is a flow chart illustrating another memory failure handling method according to an example embodiment. Illustratively, the method includes S401-S403.
S401: the processor firmware receives fault analysis information of the target memory, which is sent by the out-of-band management module, wherein the fault analysis information comprises a health value, at least one piece of fault location information and at least one fault mode to which at least one physical location indicated by the at least one piece of fault location information belongs.
The failure mode refers to a failure mode or a failure behavior expression form. A failure mode is used to indicate a physical area where multiple physical locations belonging to the failure mode are located, for example, multiple physical locations are located in the same row of the target memory, and then the multiple physical locations belong to a row failure, which may indicate the physical area where the multiple physical locations are located (i.e., the row where the multiple physical locations are located).
It is understood that each of the at least one fault location information may correspond to a fault mode, and the fault modes corresponding to different fault location information may be the same or different.
It should be noted that, when the out-of-band management module predicts that an uncorrectable error will occur in no physical location on the target memory, the fault analysis information of the target memory may only include a health value, or the fault analysis information may include a health value and a fault mode without fault information to be isolated.
In some embodiments, at least one failure mode in the failure analysis information may be characterized by a failure identification, wherein different failure modes have different identifications. For example, the fault identifier "0" represents that there is no fault information to be isolated, the fault identifier "1" represents a page fault, the fault identifier "2" represents a bit fault, the fault identifier "3" represents a row fault, and the fault identifier "4" represents a storage array fault.
In some embodiments, the processor firmware sets a health threshold for each failure mode in advance, and after receiving at least one failure mode existing in the target memory, determines a health threshold corresponding to each failure mode on the target memory according to the preset health threshold of each failure mode. As shown in fig. 4, in at least one failure mode, for example, failure mode 1, \8230 @, failure mode N, respectively, each failure mode may be configured with a health threshold, with different failure modes having the same or different health thresholds.
Hereinafter, how to determine the health threshold of the failure mode will be described by taking as an example that the severity of the failure is more slight as the health value is larger.
Optionally, the at least one failure mode includes a first failure mode, and the memory failure processing method further includes: the processor firmware determines a health threshold of the first failure mode according to the repair cost of the first failure mode, wherein the health threshold of the first failure mode is inversely proportional to the repair cost. Wherein the first failure mode is any one of the at least one failure mode.
It should be noted that the repair cost of the first failure mode refers to the cost of repairing the physical location belonging to the first failure mode.
In this embodiment, by setting that the health degree threshold of the first failure mode is inversely proportional to the repair cost, the larger the repair cost of the first failure mode is, the smaller the health degree threshold of the first failure mode is, that is, the higher the possibility that the health degree value of the target memory reaches the standard (that is, is greater than the health degree threshold), so that the probability of performing fault repair on the physical location belonging to the first failure mode can be reduced, the number of times of repairing the target memory is reduced, and it is beneficial to avoid repairing the target memory at a higher cost. Conversely, the smaller the repair cost of the first failure mode is, the larger the threshold of the health degree of the first failure mode is, that is, the smaller the possibility that the health degree value of the target memory reaches the standard is, so that the probability of performing fault repair on the target memory can be increased, the number of times of repairing the target memory is increased, the potential fault of the target memory can be repaired at a lower cost, the probability of an uncorrectable error occurring in the future of the target memory is reduced, and the influence of a memory fault on the system is reduced.
In the above embodiment, the health degree threshold of the first failure mode is determined according to the repair cost of the first failure mode, so that a proper health degree threshold can be set according to the repair cost of the first failure mode, for example, a relatively high health degree threshold is set when the repair cost is high, so as to reduce the probability of performing failure repair as much as possible, and a relatively low health degree threshold is set when the repair cost is low, so as to improve the probability of performing failure repair as much as possible, which not only can avoid paying a high cost for repairing a memory failure, but also can implement performing failure repair on a target memory at a low cost, reduce the probability of an uncorrectable error in the future of the target memory, and further help to reduce the influence of the memory failure on a system.
In addition, the processor firmware is used as an execution subject to determine the health degree threshold values of different fault modes or fault repair modes, and compared with the out-of-band management module which is used as an execution subject, the out-of-band management module needs to acquire parameters for determining the health degree threshold values through the processor firmware, so that the communication pressure between the processor firmware and the out-of-band management module can be reduced, and the interaction errors in the communication process are avoided, and the accuracy of the health degree threshold values is not influenced.
How to determine the health threshold according to the repair cost includes various implementations, which are exemplified by the following ways a to B.
Mode A: the processor firmware determines a health threshold for the first failure mode based on a first number of available repair resources for the first failure mode. The first amount is inversely proportional to the repair cost.
In some embodiments, the processor firmware obtains a first number and determines a health threshold for the first failure mode based on the first number. Specifically, when the value of the first number is larger, the influence on the target memory is smaller after the physical location belonging to the first failure mode is subjected to the failure repair, that is, even if a part of available repair resources is currently used to perform the failure repair on the physical location belonging to the first failure mode, when the future health degree of the target memory is worse (for example, the health degree value is smaller), enough available repair resources are still available to perform the failure repair on the first failure mode of the target memory. Therefore, when the numerical value of the first quantity is large, the cost of fault repair on the physical position belonging to the first fault mode is low, and a large health threshold value can be set, so that the standard of the health of the target memory is improved, the probability that the health value of the target memory reaches the standard is favorably reduced, and the probability of fault repair on the target memory is improved.
In this manner, the health threshold of the first failure mode is determined according to the first number of available repair resources of the first failure mode, which is helpful for reasonably using the limited available repair resources, for example, when the available repair resources of the first failure mode are more, a larger health threshold is determined for the first failure mode, and when the available repair resources of the first failure mode are less, a smaller health threshold is determined for the first failure mode, so as to improve the utilization rate of the available repair resources, avoid the limited available repair resources from being over-used, and cause that there is not enough available repair resources to perform fault repair when the health of the target memory is predicted to be worse in the future.
Mode B: and the processor firmware determines the health degree threshold of the first failure mode according to the influence degree of the first failure mode on the performance of the computer equipment to which the target memory belongs. The degree of influence is proportional to the repair cost.
It should be noted that the influence degree of the first failure mode on the performance of the computer device to which the target memory belongs refers to the influence degree of repairing the physical location belonging to the first failure mode on the performance of the computer device.
In some embodiments, taking the example that the first failure recovery mode is used to recover the physical location belonging to the first failure mode as an example, the processor firmware obtains the degree of influence of the first failure recovery mode on the performance of the computer device to which the target memory belongs, and determines the health threshold of the first failure mode according to the level of the degree of influence. Specifically, the larger the influence of the first failure recovery method on the performance is, the larger the cost of performing failure recovery on the physical location using the first failure recovery method is, the smaller the health threshold of the first failure mode is. Conversely, the greater the health threshold for the first failure mode.
For example, a physical location (the physical location is a storage array) belonging to a storage array fault model needs to be repaired by using a storage array isolation and repair method, and when the storage array is repaired by using a target memory, the degree of influence on the performance of the computer device is large, so that the health threshold of the storage array fault model should be set relatively small. The physical location (the physical location is a bit) belonging to the bit fault mode, the bit isolation repair mode that needs to be used is used for performing fault repair, and when the bit isolation repair is performed on the target memory, the influence degree on the performance of the computer device is small, so that the health threshold of the bit fault mode is set to be relatively large.
It will be appreciated that the larger the area of a physical location, the greater the extent to which the repair physical location affects the performance of the computer device in the target memory. The influence degree of the row fault mode on the performance is smaller than the influence degree of the storage array fault mode on the performance and larger than the influence degree of the bit fault mode on the performance, so that the health degree threshold value of the row fault mode is larger than the health degree threshold value of the storage array fault mode and smaller than the health degree threshold value of the row fault mode.
It should be noted that the influence on the performance of the computer device may be performance of the computer device, such as bandwidth, delay, jitter, packet loss, and the like.
In this manner, the health threshold of the first failure mode is determined according to the influence degree of the first failure mode on the performance of the computer device, so that a proper health threshold can be set for the first failure mode according to the influence degree of the first failure mode on the performance, for example, when the first failure mode has a large influence on the performance, a smaller health threshold is determined for the first failure mode, and when the failure recovery mode used by the first failure mode has a small influence on the performance, a larger health threshold is determined for the first failure mode, so that the physical location on the target memory having the lowest influence on the performance can be subjected to failure recovery, which is beneficial to avoiding the influence on the performance of the computer device for recovering the memory failure.
In some embodiments, the health threshold of the first failure mode may be a fixed value, i.e., after the health threshold of the first failure mode is first set, the health threshold of the first failure mode is not changed. In other embodiments, the health threshold of the first failure mode is a dynamic value, that is, after the health threshold of the first failure mode is set for the first time, the health threshold of the first failure mode may be updated.
Optionally, the memory failure processing method further includes: the processor firmware updates the health threshold for the first failure mode based on the second amount of available repair resources for the first failure mode.
In some embodiments, after the processor firmware performs the fault repair on the target memory, the second number of available repair resources of the first fault mode is obtained, and the health threshold of the first fault mode is updated according to the obtained first number.
In other embodiments, the processor firmware periodically obtains a second number of available repair resources for the first failure mode according to a second preset period, and updates the health threshold for the first failure mode according to the second number.
It should be noted that the implementation principle of the processor firmware updating the health threshold of the first failure mode according to the second number is the same as the implementation principle of the processor firmware determining the health threshold of the first failure mode according to the first number, and therefore, reference may be made to the above-mentioned manner a for the implementation process and the related description of the processor firmware updating the health threshold of the first failure mode according to the second number, which is not described in detail herein.
In the above embodiment, the health degree threshold of the first failure mode is updated according to the second number of the available repair resources of the first failure mode, so that the expected health degree of the physical location is dynamically adjusted, and thus the health degree threshold of the first failure mode can accurately represent the current cost for repairing the physical location belonging to the first failure mode, which is helpful for more accurately measuring whether to perform fault repair on the physical location belonging to the first failure mode, and is further helpful for maximally utilizing the repair resources to perform fault repair.
S402: for each of the at least one failure mode, it is determined whether the health value reaches a health threshold for that failure mode.
If not, executing S403; if yes, go to S404.
S403: and the processor firmware determines the fault mode as a target fault mode and repairs the target physical position according to a fault repairing mode applicable to the target fault mode.
The target physical location is a physical location belonging to the target failure mode in the at least one physical location.
It should be noted that the number of the target physical locations may be one, or may be multiple, or may be all of at least one physical location. For example, when the health value is only less than or equal to the health threshold of the failure mode to which one physical location belongs, then only that one physical location of the at least one physical location is the target physical location. When the health value is less than or equal to the health threshold for the failure mode to which each physical location belongs, then in at least one physical location, each physical location is the target physical location.
Each physical position in at least one physical position corresponds to one fault mode, and each fault mode in at least one fault mode is configured with one health degree threshold, so that whether the health degree value of at least one physical position reaches the standard can be judged according to the size relation between the health degree threshold and the health degree value of each fault mode in at least one fault mode, and when the health degree value does not reach the health degree threshold of one fault mode, namely the health degree of the physical position belonging to one fault mode does not reach the standard, the fault mode is determined as the target fault mode, and fault repair is carried out on the physical position (namely the target physical position) belonging to the one fault mode. It should be noted that the one failure mode may be any one failure mode of at least one mode failure.
In some embodiments, the failure analysis information may further include at least one failure recovery manner that needs to be used by the at least one physical location, where the at least one failure recovery manner may be indicated by an identifier of the failure recovery manner.
The processor firmware performs fault repair on the target physical location belonging to the target fault mode, which may be to perform a corresponding isolation self-healing action (i.e., an isolation repair manner) on the target physical location, for example, bit level isolation (such as PCLS/ACLS), row isolation (PPR), storage array level isolation (ADDDC/ADDEC), page isolation (pageofline), and other repair actions, so as to repair the target physical location.
It should be noted that the failure modes correspond to the failure recovery modes one to one, for example, a page fault corresponds to page isolation, a bit fault corresponds to bit isolation, a row fault corresponds to row isolation, and a storage array fault corresponds to storage array isolation. And after the processor firmware acquires the fault identifier of the fault mode, the processor firmware can carry out fault repair on the target physical position by using a fault repair mode corresponding to the fault mode indicated by the fault identifier.
S404: and when the severity of the fault is less than or equal to the alarm threshold, the processor firmware outputs alarm information, and the alarm information is used for prompting that the target memory has the risk of uncorrectable errors.
For the related description of S404, reference may be made to the description in S303 above, which is not described herein again.
In the above embodiment, a health degree threshold is set for each fault mode in advance, and at least one fault mode to which at least one physical location where a fault occurs in the target memory belongs is acquired, so that a health degree threshold is configured for at least one fault mode existing in the target memory, and further, whether fault repair is performed on the target physical location belonging to the target fault mode can be determined according to a magnitude relationship between the health degree threshold and the health degree value of the target fault mode in the at least one fault mode. On one hand, the health degree threshold is respectively configured for at least one fault mode, so that the health degree threshold is respectively configured for at least one physical position, and then the target physical position with the health degree not reaching the standard can be screened out according to the relation between the health degree threshold of each physical position and the health degree value of the target memory, therefore, the target physical position to be repaired is screened according to the standard reaching condition of each physical position, the accuracy of determining the target physical position is improved, the target physical position is ensured to be the physical position with the highest fault severity degree in at least one physical position, namely, the physical position which is needed to be repaired most urgently, and the influence of the memory fault on the system can be further reduced relative to the repair of the non-target physical position. On the other hand, only the target physical location is subjected to fault restoration instead of all the physical locations, so that the utilization rate of fault restoration resources is improved, the limited fault restoration resources are reasonably utilized, the situation that the fault restoration resources are not used excessively to cause that no fault restoration resources are restored when the fault severity of the target memory is more serious in the future is avoided, and the risk that uncorrectable errors occur in the target memory in the future is reduced at low cost.
In the solution of the memory fault processing method described in the embodiment shown in fig. 4, whether the health degree value reaches the standard is determined according to the health degree threshold of the fault mode, an embodiment of the present application further provides another memory fault processing method shown in fig. 5, which is different from the solution of the embodiment shown in fig. 3. In the following, only the differences between the two schemes will be specifically described, and the common parts will not be described in detail.
Fig. 5 is a flow chart illustrating another method of memory fault handling in accordance with an example embodiment. Illustratively, the method includes S501-S504.
S501: the processor firmware receives fault analysis information of the target memory, which is sent by the out-of-band management module, wherein the fault analysis information comprises a health value, at least one piece of fault location information and at least one fault repair mode used by at least one physical location indicated by the at least one piece of fault location information.
The fault repairing mode refers to an isolation repairing mode suitable for being used in a physical position. A failure recovery method is used to indicate a physical area where a plurality of physical locations using the failure recovery method are located, for example, a plurality of physical locations are located in a same row of the target memory, and the plurality of physical locations are suitable for using a row isolation method, which may indicate a physical area where the plurality of physical locations are located (i.e., a row where the plurality of physical locations are located).
It can be understood that each piece of fault location information corresponds to one fault repairing mode, and the fault repairing modes corresponding to different pieces of fault location information may be the same or different.
In some embodiments, the processor firmware sets a health threshold for each fault repairing mode in advance, and determines a health threshold corresponding to each of the at least one fault repairing modes required to be used on the target memory according to the preset health threshold of each fault repairing mode after acquiring the at least one fault repairing mode required to be used by the target memory. For example, as shown in fig. 5, in at least one fault recovery mode, for example, fault recovery mode 1, \8230 @, fault recovery mode N, each fault recovery mode may match a health threshold, with different fault recovery modes having the same or different health thresholds.
It should be noted that the implementation principle of S501 is the same as the implementation principle of S401, and therefore, reference may be made to S401 as to the implementation process and the related description of S501, for example, as to how the health threshold of the first failure recovery mode in the at least one failure recovery mode is determined, reference may be made to the determination process of the health threshold of the first failure mode, and details are not described here.
S502: and judging whether the health degree value reaches a health degree threshold value of the fault repairing mode or not for each fault repairing mode in at least one fault repairing mode.
If not, executing S503; if yes, go to step S504.
S503: and the processor firmware determines the fault repairing mode as a target fault repairing mode and repairs the target physical position based on the target fault repairing mode.
The target physical location is a physical location which needs to use a target fault repairing mode to repair faults in at least one physical location.
It should be noted that the implementation principle of S503 is the same as that of S403 described above, and therefore, reference may be made to S403 described above for the implementation process of S503 and related description, and details will not be described herein.
S504: and when the health degree value is smaller than or equal to the alarm threshold value, the processor firmware outputs alarm information, the alarm information is used for prompting that the risk of uncorrectable errors exists in the target memory, and the alarm threshold value is larger than the target health degree threshold value.
For the related description of S504, reference may be made to the description in S303 above, which is not described herein again.
In the above embodiment, a health degree threshold is set in advance for each fault repairing mode, and at least one fault repairing mode used by at least one physical location where a fault occurs in the target memory is obtained, so that a health degree threshold is configured for each of the at least one fault repairing modes needed by the target memory, and whether fault repairing is performed on the target physical location where the target fault repairing mode is used can be determined according to a size relationship between the health degree threshold and the health degree value of the target fault repairing mode in the at least one fault repairing mode. On the one hand, the health degree threshold is respectively matched for at least one fault repairing mode, so that the health degree threshold is respectively configured for at least one physical position, and then the target physical position with the health degree not reaching the standard can be screened out according to the relation between the health degree threshold of each physical position and the health degree value of the target memory, therefore, the target physical position is screened according to the standard reaching condition of each physical position, the accuracy of determining the target physical position is improved, the target physical position is ensured to be the physical position with the highest fault severity degree in at least one physical position, namely, the physical position which is needed to be repaired most urgently, and the influence of the memory fault on the system can be reduced relative to the repair of the non-target physical position. On the other hand, only the target physical location is subjected to fault repair, but not all the physical locations are subjected to fault repair, so that the utilization rate of fault repair resources is increased, limited fault repair resources are reasonably utilized, the condition that no fault repair resources are available for repairing when the fault severity of the target memory is more serious in the future due to excessive use of the fault repair resources is avoided, and the risk that uncorrectable errors occur in the target memory in the future is reduced at low cost.
The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, the memory fault handling device (e.g., processor firmware or out-of-band management module) includes hardware structures and/or software modules corresponding to the respective functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed in hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
According to the method, the memory fault processing apparatus (such as the processor firmware) may be exemplarily divided into the functional modules, for example, the memory fault processing apparatus may include the functional modules corresponding to the functional divisions, or may integrate two or more functions into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
Fig. 6 shows a schematic diagram of a possible structure of the memory failure processing apparatus (referred to as the memory failure processing apparatus 600) in the foregoing embodiment, for example, the memory failure processing apparatus 600 may be a processor firmware, and specifically includes a receiving unit 601 and a processing unit 602. A receiving unit 601, configured to receive fault analysis information of a target memory, where the fault analysis information includes fault severity of the target memory and at least one fault location information; the at least one fault location information indicates at least one physical location in the target memory where the fault occurred. For example, S301 shown in fig. 3, S401 shown in fig. 4, and S501 shown in fig. 5. The processing unit 602 is configured to, when the severity of the fault meets a preset condition, perform fault recovery on at least one physical location where the fault occurs in the target memory. Such as 302 shown in fig. 3, S402-S403 shown in fig. 4, and S502-S503 shown in fig. 5.
Optionally, the fault analysis information further includes at least one fault mode to which the at least one physical location belongs, and each fault mode is configured with a health degree threshold; the fault analysis information comprises the fault severity of the target memory, specifically, the fault analysis information comprises the fault severity of the target memory, and the fault severity is used for indicating the fault severity of the target memory; the severity of the fault satisfies preset conditions, including: the health value does not reach a health threshold of a target failure mode of the at least one failure mode; performing fault repair on at least one physical location where a fault occurs in a target memory, specifically including: and performing fault repair on a target physical location in at least one physical location with a fault in the target memory, wherein the target physical location is a physical location belonging to a target fault mode in the at least one physical location.
Optionally, the at least one failure mode includes a first failure mode, and the processing unit 602 is further configured to: and determining a health threshold of the first failure mode according to the repair cost of the first failure mode.
Optionally, the processing unit 602 is further specifically configured to: obtaining a first number of available repair resources for a first failure mode; determining a health threshold for the first failure mode based on the first number; the first amount is inversely proportional to the repair cost.
Optionally, the processing unit 602 is further specifically configured to: and determining a health degree threshold of the first failure mode according to the influence degree of the first failure mode on the performance of the computer equipment to which the target memory belongs, wherein the influence degree is in direct proportion to the repair cost.
Optionally, the at least one failure mode includes a first failure mode, and the processing unit 602 is further configured to: obtaining a second number of available repair resources for the first failure mode; the health threshold of the first failure mode is updated based on the second number.
Optionally, the fault analysis information further includes at least one fault repairing mode used by at least one physical location, and each fault repairing mode is configured with a health degree threshold; the fault analysis information comprises the fault severity of the target memory, specifically the fault analysis information comprises the fault severity of the target memory, and the fault severity is used for indicating the fault severity of the target memory; failure severity does not meet standards, including: the health degree value does not reach a health degree threshold value of a target fault repairing mode in at least one fault repairing mode; performing fault repair on at least one physical location where a fault occurs in a target memory, specifically including: and performing fault repair on a target physical location in the at least one physical location where the fault occurs in the target memory, wherein the target physical location is a physical location using a target fault repair mode in the at least one physical location.
Optionally, the processing unit 602 is further configured to output an alarm message when the severity of the fault meets an alarm condition, where the alarm message is used to prompt that the target memory has a risk of an uncorrectable error.
For the detailed description of the above alternative modes, reference may be made to the foregoing method embodiments, which are not described herein again. In addition, for any explanation and beneficial effect description of the memory fault handling apparatus 600 provided above, reference may be made to the corresponding method embodiment described above, and details are not repeated.
According to the method, the memory failure processing apparatus (such as an out-of-band management module) may be exemplarily divided into the functional modules according to the method, for example, the memory failure processing apparatus may include each functional module corresponding to each functional division, or may integrate two or more functions into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
Fig. 7 exemplarily shows a schematic diagram of a possible structure of the memory failure processing apparatus (referred to as the memory failure processing apparatus 700) according to the foregoing embodiment, for example, the memory failure processing apparatus 700 may be an out-of-band management module, and specifically includes an obtaining unit 701, a determining unit 702, and a sending unit 703. An obtaining unit 701 is configured to obtain error information of a target memory. For example, S201 shown in fig. 2. A determining unit 702, configured to determine, according to the error information, fault analysis information of the target memory, where the fault analysis information includes fault severity of the target memory and at least one fault location information; the at least one fault location information indicates at least one physical location in the target memory where the fault occurred. For example, S202 shown in fig. 2. A sending unit 703, configured to send failure analysis information to the processor firmware; the fault analysis information is used for performing fault repair on at least one physical position with a fault in the target memory by the processor firmware under the condition that the fault severity meets the preset condition. For example, S203 shown in fig. 2.
Optionally, the fault analysis information includes a fault severity of the target memory, specifically, the fault analysis information includes a fault severity of the target memory, and the fault severity is used to indicate the fault severity of the target memory; the fault analysis information also comprises at least one fault mode to which at least one physical position belongs, and each fault mode is configured with a health degree threshold value; the fault analysis information is specifically used for the processor firmware to perform fault repair on a target physical location in at least one physical location in which a fault occurs in the target memory under the condition that the health degree value does not reach a health degree threshold of a target fault mode in the at least one fault mode, wherein the target physical location is a physical location belonging to the target fault mode in the at least one physical location.
Optionally, the determining unit 702 is specifically configured to: and clustering at least one fault position information according to the error information to obtain at least one fault mode.
Optionally, the failure mode comprises any of: page faults, bit faults, row faults, and storage array faults.
Optionally, the failure analysis information includes a failure severity of the target memory, specifically, the failure analysis information includes a failure severity of the target memory, and the failure severity is used to indicate the failure severity of the target memory; the fault analysis information also comprises at least one fault repairing mode used by at least one physical position, and each fault repairing mode is configured with a health degree threshold value; the fault analysis information is specifically used for the processor firmware to perform fault repair on a target physical location in at least one physical location in which a fault occurs in the target memory under the condition that the health degree value does not reach a health degree threshold of a target fault repair mode in the at least one fault repair mode, and the target physical location is a physical location in the at least one physical location where the target fault repair mode is used.
Optionally, the determining unit 702 is specifically configured to: and according to the error information, clustering at least one fault position information to obtain at least one fault repairing mode.
Optionally, the fault recovery means includes any one of: page isolation, bit isolation, row isolation, and storage array isolation.
Optionally, the determining unit 702 is specifically configured to: and inputting the error information of the target memory into the machine learning model to obtain the fault analysis information of the target memory output by the machine learning model.
For the detailed description of the above alternative modes, reference may be made to the foregoing method embodiments, which are not described herein again. In addition, for any explanation and beneficial effect description of the memory fault handling apparatus 700 provided above, reference may be made to the corresponding method embodiments described above, and details are not repeated.
An embodiment of the present application further provides a computer device, which includes a processor firmware and an out-of-band management module. The processor firmware is configured to execute the embodiments shown in fig. 3 to 5, and the out-of-band management module is configured to execute the embodiments shown in fig. 2. The embodiment of the application does not limit the concrete form of the computer equipment. For example, the computer device may specifically be a terminal apparatus, and may also be a network device. Among them, the terminal device may be referred to as: a terminal, user Equipment (UE), terminal device, access terminal, subscriber unit, subscriber station, mobile station, remote terminal, mobile device, user terminal, wireless communication device, user agent, or user device, etc. The terminal device may be a mobile phone, an Augmented Reality (AR) device, a Virtual Reality (VR) device, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like. The network device may specifically be a server or the like. The server may be one physical or logical server, or two or more physical or logical servers sharing different responsibilities and cooperating with each other to realize each function of the server.
Embodiments of the present application also provide a computer-readable storage medium having a computer program stored thereon, which, when run on a computer, causes the computer to perform any one of the methods provided above.
For the explanation and the description of the beneficial effects of any one of the computer-readable storage media provided above, reference may be made to the corresponding embodiments described above, and details are not repeated here.
Embodiments of the present application also provide a chip, which may be a chip of processor firmware or a chip of an out-of-band management module, for example. The chip has integrated therein control circuitry and one or more ports for implementing the functions of the computer device described above. Optionally, the functions supported by the chip may refer to the above, and are not described herein again. Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments may be implemented by a program instructing the associated hardware to perform the steps. The program may be stored in a computer-readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an Application Specific Integrated Circuit (ASIC), a microprocessor (DSP), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
Embodiments of the present application further provide a computer program product containing instructions, which when executed on a computer, cause the computer to execute any one of the methods in the foregoing embodiments. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), among others.
It should be noted that the above devices for storing computer instructions or computer programs provided in the embodiments of the present application, such as, but not limited to, the above memories, computer readable storage media, communication chips, and the like, are all nonvolatile (non-volatile).
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (19)

1. A memory fault handling method is characterized by comprising the following steps:
a Central Processing Unit (CPU) receives fault analysis information of a target memory, wherein the fault analysis information comprises the fault severity of the target memory and at least one fault position information; the at least one fault location information indicates at least one physical location in the target memory where a fault occurred;
and under the condition that the severity of the fault meets a preset condition, the CPU performs fault repair on at least one physical position with the fault in the target memory.
2. The method of claim 1, wherein the fault analysis information further includes at least one fault mode to which the at least one physical location belongs, each fault mode configured with a health threshold;
the failure analysis information includes a failure severity of the target memory, specifically, the failure analysis information includes a failure severity of the target memory, and the failure severity is used for indicating the failure severity of the target memory;
the fault severity satisfies a preset condition, including: the health value does not reach a health threshold of a target failure mode of the at least one failure mode;
the performing fault repair on at least one physical location where a fault occurs in the target memory specifically includes: and performing fault repair on a target physical location in at least one physical location with a fault in the target memory, wherein the target physical location is a physical location belonging to the target fault mode in the at least one physical location.
3. The method of claim 2, wherein the at least one failure mode comprises a first failure mode, the method further comprising:
and the CPU determines a health degree threshold value of the first failure mode according to the repair cost of the first failure mode.
4. The method of claim 3, wherein the determining, by the CPU, the health threshold for the first failure mode based on the repair cost for the first failure mode comprises:
the CPU obtains a first number of available repair resources of the first failure mode;
the CPU determines a health threshold of the first failure mode according to the first quantity; the first amount is inversely proportional to the repair cost.
5. The method of claim 3, wherein determining, by the CPU, the health threshold for the first failure mode based on the repair cost for the first failure mode comprises:
and the CPU determines a health degree threshold value of the first fault mode according to the influence degree of the first fault mode on the performance of the computer equipment to which the target memory belongs, wherein the influence degree is in direct proportion to the repair cost.
6. The method of any of claims 2-5, wherein the at least one failure mode comprises a first failure mode, the method further comprising:
the CPU obtains a second number of available repair resources of the first failure mode;
and the CPU updates the health degree threshold value of the first fault mode according to the second quantity.
7. The method of claim 1, wherein the fault analysis information further includes at least one fault recovery pattern used by the at least one physical location, each fault recovery pattern configured with a health threshold;
the fault analysis information comprises the fault severity of the target memory, specifically, the fault analysis information comprises the fault severity of the target memory, and the fault severity is used for indicating the fault severity of the target memory;
the severity of the fault meets preset conditions, including: the health degree value does not reach a health degree threshold value of a target fault repairing mode in the at least one fault repairing mode;
the performing fault repair on at least one physical location where a fault occurs in the target memory specifically includes: and performing fault repair on a target physical location in at least one physical location with a fault in the target memory, wherein the target physical location is a physical location using the target fault repair mode in the at least one physical location.
8. The method according to any one of claims 1-7, further comprising:
and under the condition that the severity of the fault meets an alarm condition, the CPU outputs alarm information, wherein the alarm information is used for prompting that the target memory has the risk of uncorrectable errors.
9. A memory fault handling method is characterized by comprising the following steps:
the out-of-band controller acquires error information of a target memory;
the out-of-band controller determines fault analysis information of the target memory according to the error information, wherein the fault analysis information comprises fault severity of the target memory and at least one fault position information; wherein, the at least one fault location information is respectively used for indicating at least one physical location where a fault occurs in the target memory;
the out-of-band controller sends the fault analysis information to a Central Processing Unit (CPU); and the fault analysis information is used for the CPU to carry out fault repair on at least one physical position with a fault in the target memory under the condition that the severity of the fault meets a preset condition.
10. The method of claim 9,
the fault analysis information comprises the fault severity of the target memory, specifically, the fault analysis information comprises the fault severity of the target memory, and the fault severity is used for indicating the fault severity of the target memory;
the fault analysis information further comprises at least one fault mode to which the at least one physical location belongs, and each fault mode is configured with a health degree threshold; the fault analysis information is specifically used for the CPU to perform fault repair on a target physical location in at least one physical location in which a fault occurs in the target memory, where the target physical location is a physical location in the at least one physical location that belongs to the target fault mode, when the health degree value does not reach a health degree threshold of a target fault mode in the at least one fault mode.
11. The method of claim 10, further comprising:
and the out-of-band controller carries out clustering processing on the at least one fault position information to obtain the at least one fault mode.
12. The method according to claim 10 or 11, wherein the failure mode comprises any of: page faults, bit faults, row faults, and storage array faults.
13. The method of claim 9,
the failure analysis information includes a failure severity of the target memory, specifically, the failure analysis information includes a failure severity of the target memory, and the failure severity is used for indicating the failure severity of the target memory;
the fault analysis information further comprises at least one fault repairing mode used by the at least one physical location, and each fault repairing mode is configured with a health degree threshold; the fault analysis information is specifically used for the CPU to perform fault repair on a target physical location in at least one physical location in the target memory, where a fault occurs, when the health degree value does not reach a health degree threshold of a target fault repair mode in the at least one fault repair mode, where the target physical location is a physical location in the at least one physical location where the target fault repair mode is used.
14. The method of claim 13, further comprising:
and the out-of-band controller carries out clustering processing on the at least one fault position information to obtain the at least one fault repairing mode.
15. The method according to claim 13 or 14, wherein the fault recovery means comprises any of: page isolation, bit isolation, row isolation, and memory array isolation.
16. The method according to any of claims 9-15, wherein the determining, by the out-of-band controller, the failure analysis information of the target memory based on the error information of the target memory comprises:
and the out-of-band controller inputs the error information of the target memory into a machine learning model to obtain the fault analysis information of the target memory output by the machine learning model.
17. A memory fault handling system, comprising:
processor firmware for performing the method of any one of claims 1-8;
an out-of-band management module for performing the method of any one of claims 9-16.
18. A computer device, comprising: a processor;
the processor is coupled to a memory for storing computer-executable instructions, the processor executing the computer-executable instructions stored in the memory to cause the computer device to implement the method of any one of claims 1-8 or to implement the method of any one of claims 9-16.
19. A computer-readable storage medium storing computer instructions which, when executed on a computer device, cause the computer to perform the method of any one of claims 1-8 or perform the method of any one of claims 9-16.
CN202210871008.XA 2022-07-22 2022-07-22 Memory fault processing method, system and storage medium Pending CN115391075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210871008.XA CN115391075A (en) 2022-07-22 2022-07-22 Memory fault processing method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210871008.XA CN115391075A (en) 2022-07-22 2022-07-22 Memory fault processing method, system and storage medium

Publications (1)

Publication Number Publication Date
CN115391075A true CN115391075A (en) 2022-11-25

Family

ID=84115830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210871008.XA Pending CN115391075A (en) 2022-07-22 2022-07-22 Memory fault processing method, system and storage medium

Country Status (1)

Country Link
CN (1) CN115391075A (en)

Similar Documents

Publication Publication Date Title
JP7158586B2 (en) Hard disk failure prediction method, apparatus and storage medium
CN114064333A (en) Memory fault processing method and device
EP3979079A1 (en) Memory fault handling method and apparatus, device and storage medium
US20230385141A1 (en) Multi-factor cloud service storage device error prediction
US8108724B2 (en) Field replaceable unit failure determination
CN114968652A (en) Fault processing method and computing device
WO2022028209A1 (en) Memory failure processing method and apparatus
US9069819B1 (en) Method and apparatus for reliable I/O performance anomaly detection in datacenter
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
CN115168087A (en) Method and device for determining granularity of repair resources of memory failure
US8176388B1 (en) System and method for soft error scrubbing
WO2024027325A1 (en) Memory fault handling methods and apparatuses, and storage medium
Zhang et al. Predicting dram-caused node unavailability in hyper-scale clouds
CN117668706A (en) Method and device for isolating memory faults of server, storage medium and electronic equipment
CN115421947A (en) Memory fault processing method and device and storage medium
CN115080331A (en) Fault processing method and computing device
CN115391075A (en) Memory fault processing method, system and storage medium
CN116775436A (en) Chip fault prediction method, device, computer equipment and storage medium
CN115391072A (en) Memory fault processing method, system and storage medium
CN115391074A (en) Memory fault processing method, system and storage medium
CN115269245B (en) Memory fault processing method and computing device
CN115495301A (en) Fault processing method, device, equipment and system
CN115658358A (en) Memory fault processing method and computer equipment
CN115686901A (en) Memory fault analysis method and computer equipment
CN116302740A (en) Memory fault repair capability assessment method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231112

Address after: 450046, 10th Floor, North Chuangzhi Tiandi Building, Shigeng Street, Longzihu Wisdom Island Middle Road East, Zhengdong New District, Zhengzhou City, Henan Province

Applicant after: Henan Kunlun Technology Co.,Ltd.

Address before: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province

Applicant before: Super fusion Digital Technology Co.,Ltd.