CN114860487A - Memory fault identification method and memory fault isolation method - Google Patents

Memory fault identification method and memory fault isolation method Download PDF

Info

Publication number
CN114860487A
CN114860487A CN202210351887.3A CN202210351887A CN114860487A CN 114860487 A CN114860487 A CN 114860487A CN 202210351887 A CN202210351887 A CN 202210351887A CN 114860487 A CN114860487 A CN 114860487A
Authority
CN
China
Prior art keywords
fault
memory
failure
memory unit
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210351887.3A
Other languages
Chinese (zh)
Inventor
马旭华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210351887.3A priority Critical patent/CN114860487A/en
Publication of CN114860487A publication Critical patent/CN114860487A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The present specification provides a memory fault identification method and a memory fault isolation method, a fault database is configured in advance, fault records of a full life cycle of all memory units are recorded in the fault database, and each memory unit fault record at least comprises the times of occurrence of CE in the full life cycle of the memory unit; updating the failure database under the condition that the memory unit generates a correctable error CE; and aiming at each memory unit, determining that the hard fault exists in the memory unit under the condition that the frequency of occurrence of CE in the full life cycle of the memory unit in the fault database exceeds a frequency threshold. The memory faults are isolated through the fault database which records the number of times of the CE of the full life cycle of each memory unit, the isolation accuracy and coverage rate can be improved, the probability that soft fault is mistakenly identified as hard fault is reduced, and the probability that the hard fault is isolated is improved.

Description

Memory fault identification method and memory fault isolation method
Technical Field
One or more embodiments of the present disclosure relate to the field of terminal technologies, and in particular, to a memory fault identification method and a memory fault isolation method.
Background
When a memory fault is found, a memory page (page) corresponding to the address of the fault needs to be offline (that is, an operating system cannot continuously write data into the memory page, and the memory page after offline is an isolated memory page), so as to prevent the computer from accessing the memory page and affecting the operating performance of the computer.
The memory faults are classified into Permanent faults and Transient faults, wherein the soft faults are Transient faults and can be recovered after a period of time, so the soft faults do not need to be isolated, and the hard faults which need to be isolated are recurrent faults.
For the os, when a certain memory unit fails, the os reports Correctable Error (CE). In other words, no matter what hard fault and soft fault, the operating system reports CE when it finds that the operating system cannot distinguish between the two faults.
In the related art, the hard fault is generally identified by a daemon process (demon), and since the hard fault has repeatability, if a CE occurs twice on a memory page within a certain time (for example, 24 hours), the demon considers that the hard fault exists on the memory page.
However, the hard fault identification method in the related art is not accurate, and for some fields with high requirements on memory capacity or high requirements on performance, the hard fault identification method can affect the operation performance or make the memory capacity meet the requirements.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a memory fault identification method and a memory fault isolation method.
According to a first aspect of one or more embodiments of the present specification, a memory fault identification method is provided, where a fault database is configured in advance, fault records of a full life cycle of all memory units are recorded in the fault database, and each memory unit fault record at least includes a number of times of occurrence of a CE in the full life cycle of the memory unit; the method comprises the following steps:
updating a fault database under the condition that the memory unit generates a correctable error CE;
and aiming at each memory unit, determining that hard fault exists in the memory unit under the condition that the times of CE occurrence in the whole life cycle of the memory unit in the fault database exceed a time threshold.
According to a second aspect of one or more embodiments of the present specification, there is provided a memory fault isolation method, including:
determining a fault result of each memory unit according to the memory fault identification method; the result type of each memory unit is used for representing whether the hard fault exists in each memory unit;
determining a fault memory page according to a memory unit fault result corresponding to each memory page; wherein the probability that the fault of the fault memory page causes uncorrectable wrong UE is greater than a probability threshold;
isolating the failed memory page.
According to a third aspect of the embodiments of the present specification, there is provided a memory failure recognition apparatus, which pre-configures a failure database, where failure records of a full life cycle of all memory units are recorded in the failure database, and each memory unit failure record at least includes a number of times of occurrence of a CE in the full life cycle of the memory unit; the device comprises:
the failure database updating module is used for updating the failure database under the condition that the memory unit generates a correctable error CE;
and the memory failure identification module is used for determining that the hard fault exists in each memory unit under the condition that the frequency of occurrence of CE in the memory unit in the whole life cycle of the memory unit in the failure database exceeds a frequency threshold.
According to a fourth aspect of embodiments herein, there is provided a memory fault isolation apparatus, the apparatus comprising:
the fault result determining module is used for determining the fault result of each memory unit according to the fault identification method; the result type of each memory unit is used for representing whether the hard fault exists in each memory unit;
a failure memory page determining module, configured to determine a failure memory page according to a memory unit failure result corresponding to each memory page; wherein the probability that the fault of the fault memory page causes uncorrectable wrong UE is greater than a probability threshold;
and the memory page isolation module is used for isolating the fault memory page.
According to a fifth aspect of embodiments herein, there is provided a memory fault isolation system, including a plurality of servers, and a central device for managing the plurality of servers;
the server reports the CE to the central equipment under the condition that the CE occurs;
the central device is at least used for identifying the memory unit with the hard fault by the memory fault identification method, or determining the memory page needing to be isolated by the memory fault isolation method, and informing the server of isolating the memory page.
According to a sixth aspect of embodiments herein, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above memory fault identification method or the above memory fault isolation method.
According to a seventh aspect of embodiments herein, there is provided an electronic apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
the processor executes the executable instructions to implement the memory fault identification method or the memory fault isolation method.
The present specification provides a memory fault identification method and a memory fault isolation method, a fault database is configured in advance, fault records of a full life cycle of all memory units are recorded in the fault database, and each memory unit fault record at least comprises the times of occurrence of CE in the full life cycle of the memory unit; updating the failure database under the condition that the memory unit generates a correctable error CE; and aiming at each memory unit, determining that the hard fault exists in the memory unit under the condition that the frequency of occurrence of CE in the full life cycle of the memory unit in the fault database exceeds a frequency threshold.
The memory faults are isolated through the fault database which records the number of times of the CE of the full life cycle of each memory unit, the isolation accuracy and coverage rate can be improved, the probability that soft fault is mistakenly identified as hard fault is reduced, and the probability that the hard fault is isolated is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1 is a flowchart illustrating a memory fault isolation method according to the related art.
Fig. 2A is a flowchart illustrating a memory failure identification method according to an exemplary embodiment of the present disclosure.
FIG. 2B is a schematic diagram illustrating one type of fault shown in the present specification according to an exemplary embodiment.
Fig. 3 is a flow chart illustrating a method of memory fault isolation according to an exemplary embodiment of the present disclosure.
Fig. 4A is a schematic diagram of a memory fault isolation method according to an embodiment of the present disclosure.
Fig. 4B is a flow chart illustrating a method of memory fault isolation according to an embodiment of the present disclosure.
Fig. 5 is a block diagram illustrating a memory failure recognition apparatus according to an exemplary embodiment of the present disclosure.
Fig. 6 is a block diagram illustrating a memory fault isolation device according to an exemplary embodiment of the present disclosure.
Fig. 7 is a hardware structure diagram of an electronic device in which a memory fault recognition apparatus or a memory fault isolation apparatus is located according to an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
The memory is one of the important components of the computer, and is a bridge for communicating with a processor (CPU), all the programs are executed in the memory, and the performance of the memory has a great influence on the computer. The memory is also called an internal memory and is used for temporarily storing operation data in the CPU and exchanging data with an external memory (such as a hard disk, etc.), the CPU transfers data to be operated into the memory for operation as long as the computer is in operation, and the CPU reads out a result after the operation is completed.
A memory failure may occur, that is, a certain memory cell (cell) always maintains a high level or a low level (in other words, the smallest unit of the memory failure is a memory cell), and the state cannot be modified. The memory faults are generally divided into hard fault and soft fault, the hard fault is a fault which can occur repeatedly in the same address, namely that the memory unit corresponding to the address is damaged; the soft fault is a random fault, and generally a fault occurs only once in an address, and the fault is generally a transient damage of a memory cell caused by high-energy rays or electromagnetic interference.
Under the condition that the memory unit is damaged, if the damaged memory unit is used by an operating system, the operating system cannot normally access the memory unit, so when the memory unit finds the condition, a CE error is reported, and other available memory units are searched by using an error correction algorithm. And in the case of excessive faults, Uncorrectable Errors (UEs) may also be caused, and if the memory page corresponding to the UE is used by the kernel mode, the system may be down.
It can be seen that if there is a damaged memory unit, the performance of the operating system will be affected, and if there are too many memory failures, the system may be down. Therefore, in the case of a memory failure, the failed memory unit needs to be isolated, that is, the operating system does not access the memory unit any more, so as to ensure the execution efficiency of the operating system and avoid the UE from bringing too much influence to the operating system.
For the soft fault, it is a random fault, i.e., the fault is recovered after a period of time, so the soft fault does not need to be isolated, and what needs to be isolated is the hard fault causing substantial damage. Therefore, the operating system needs to distinguish hard fault from soft fault to complete the isolation of memory failure. For the operating system, since the soft fault is transient, that is, one memory unit occurs only once, and the hard fault is a permanent fault, and the attack frequency is high, the hard fault and the soft fault can be distinguished by reporting the frequency of the CE.
In the related art, as shown in fig. 1, when a hardware reports a CE, a daemon determines whether the number of CE times of a memory page corresponding to the reported CE in a predetermined time (generally, 24 hours) is greater than or equal to 2, and if the number of CE times of the memory page corresponding to the reported CE is greater than or equal to 2, a page offline interface of a core is called to offline the memory page.
This method has the following problems: in the related art, a daemon process runs in a kernel, and in order to reduce the influence of the daemon process on the system execution efficiency, the daemon process cannot occupy too many resources, so that the fault isolation method is only based on simple threshold value identification, which may cause low identification accuracy and low coverage rate.
Specifically, firstly, because the search is based on the memory page, and one memory page includes many memory units, two CEs appearing in one memory page may be reported by different memory units within a certain time, and there is a certain probability that both CEs are CE occurring for the first time in the corresponding memory unit, so that the soft fault is erroneously identified as the hard fault. If the memory pages are isolated on the basis, the memory pages corresponding to the CE are isolated, and the available memory of the host is wasted.
Secondly, because the lookup is based on the memory page, if the predetermined time (the above predetermined time) is too long, the lookup error rate is high (because the longer the time is, the more the number of soft faults occur to the memory cells of the memory page), so in order to reduce the error rate and reduce the memory occupied by the daemon, the predetermined time is generally short (generally 24 hours), but the memory cells with hard faults are accessed to trigger the CE event, the access is a relatively random process, and may be continuously accessed for several minutes, days, or even months, so that a large number of hard faults cannot be discovered due to the limitation that the predetermined time is not too long.
Furthermore, if after a hard fault is found, the manner in which the hard fault is processed is to sequester memory pages, then the following problems arise: the memory page isolation is generally implemented by calling a page offline interface of a kernel, and since the page offline is not an interface (try best) that must be successful, a memory page cannot be successfully isolated in some cases, for example, when a memory page is used by a process in the kernel, the memory page cannot be successfully isolated. It can be seen that the success rate of memory page isolation is not high with the methods in the related art.
For the memory fault isolation method in the related art, for most application scenarios, although the accuracy is not high, the method has little influence on the operation of the system, and thus, the method is also an effective method. However, for some scenes that the requirement on the stability of the memory is high and the actual capacity of the memory is expected to be closer to the nominal value, the soft fault is identified as a hard fault by the method, so that the memory page which does not need to be isolated is isolated wrongly, and the actual capacity of the memory is reduced; and the method can miss some hard faults, so that the fault memory unit influences the memory stability, thereby influencing the system operation performance.
For example, for a cloud computing scene of a public cloud, firstly, the requirement on the system operation performance is high, and the ratio of service damage (service unavailability and service degradation) caused by a memory fault in a large-scale cloud computing scene is top2, which is a key big end affecting the elastic computing stability; secondly, due to the fact that the capacity of each memory of a single server is large in a cloud computing scene, due to the limitation of a memory manufacturing process, the failure rate of each memory is higher, and the number of failures of CE and UE of the single server is higher; in a cloud computing scene, the number of services borne by one server is large, and if a server generates CE or UE, the number of services possibly influenced by the server is also large, so that the server is more sensitive to memory faults and has the requirement of identifying all hard faults; third, for the public cloud, the cloud computing service is leased to different users, and the user wants the service bought by the user to be more similar to the nominal value, so in the cloud computing scenario of the public cloud, it is more desirable that the actual capacity of the memory is closer to the nominal value, that is, the situation that the soft fault is mistakenly identified as hard fault and the memory page where the soft fault is located is isolated cannot be accepted.
For the foregoing reasons, a memory fault identification method and a processing method with higher accuracy, coverage (i.e., a time range for finding a CE) and success rate are needed for a cloud computing scenario of a public cloud.
Based on this, in view of the first two disadvantages in the related art, the present specification may not set a daemon process in the kernel, but perform isolation analysis of the fault in the user state, that is, collect CE data, form a fault database of the full life cycle (although this process may be more time-consuming than the related art, and the fault database occupies a certain storage space, compared with the larger effect brought by the UE, the method in the present specification receives the data that consumes a certain memory and storage space to complete identification of the hard fault, so as to avoid the larger hazard), and determine the number of times of CE occurrence in each memory unit according to the fault database, thereby determining whether the hard fault occurs in each memory unit.
In addition, for the processing of the failure, if the method of isolating the memory page in the related art is used, it is considered that if one memory page has only one memory unit and hard fault occurs, the probability that the memory page causes the UE is low. Different memory units in the memory page may be distributed in different memory banks, and it is difficult to analyze whether the failure may cause the UE based on the failure of each memory unit corresponding to the memory page, so that the failure may be further classified at different levels, for example, based on the level of a memory bank (bank), and the failure type of the memory bank is determined according to the failure of each memory unit in the memory bank and the physical relationship of each memory unit. Therefore, whether the memory pages mapped by the memory banks are isolated or not can be determined according to the fault type of the memory banks. Furthermore, the fault type of each dimm can be determined according to the fault type of each Memory bank in a Dual-Inline Memory-Modules (dimm), and whether to perform maintenance processing on the Memory is further determined.
In addition, in this specification, the severity of the fault may also be determined according to the fault result of each memory unit included in each memory page, so as to implement isolation of the memory pages, and the memory pages with little influence may not be isolated first, so that the available memory space of the user of the rental service is larger.
Finally, for the page offline interface of the try best, the page offline can be executed in a circulating mode until the isolation is successful, and therefore the success rate of the isolation is improved.
In other words, the present specification provides a memory fault identification method and a memory fault isolation method, in which a fault database is configured in advance, fault records of a full life cycle of all memory units are recorded in the fault database, and each memory unit fault record at least includes a number of times that a CE occurs in the full life cycle of the memory unit; updating the failure database under the condition that the memory unit generates a correctable error CE; and aiming at each memory unit, determining that the hard fault exists in the memory unit under the condition that the frequency of occurrence of CE in the full life cycle of the memory unit in the fault database exceeds a frequency threshold.
The memory faults are isolated through the fault database which records the number of times of the CE of the full life cycle of each memory unit, the isolation accuracy and coverage rate can be improved, the probability that soft fault is mistakenly identified as hard fault is reduced, and the probability that the hard fault is isolated is improved.
In other words, compared with the method for identifying the hard fault based on the memory page granularity in the related art, the method provided by the specification identifies the hard fault based on the granularity of the memory unit (cell), so that the probability that the soft fault is isolated by mistake is reduced. And whether the hard fault exists in each memory unit is determined through a fault database recording the number of times of the CE of the full life cycle of each memory unit, more historical data are considered, and the hard fault can be identified even if the memory unit is accessed after a long time.
The memory failure identification method provided in the present specification will be described in detail below. As shown in fig. 2A, fig. 2A is a flowchart of a memory failure identification method according to an exemplary embodiment of the present disclosure, which includes the following steps:
step 201, updating the failure database when the memory unit generates a correctable error CE.
Before the method is executed, a fault database is also configured in advance, fault records of the full life cycle of all the memory units are recorded in the fault database, and each memory unit fault record at least comprises the frequency of occurrence of CE in the full life cycle of the memory unit.
In other words, the fault database stores fault records in a full life cycle with the memory units as the granularity, and the historical fault records of each memory unit can be known through the fault database, so that which memory unit is damaged is determined.
Next, a plurality of nouns referred to in step 201 will be described.
First, a concept of a memory cell is introduced, where the memory cell is a memory cell, the memory cells are arranged in a matrix in the memory, each row and each column corresponds to a row address line (Word line) and a column address line (bit line), each cell corresponds to a row address line and a column address line, and an address of the memory is determined by a number of the row and the column corresponding to the cell.
CE and memory failures are described in detail above, and it should be noted that a failure (fault) in this document refers to a failure that occurs physically in the memory, and an error (error) refers to a problem encountered in the operation of the operating system.
In the following, the term "failure database" will be described, and the full life cycle refers to all records from the beginning of the server operation/the beginning of the server use. The failure database is a database which is newly added in the method provided by the specification and used for storing failure records, and can store the times of occurrence of the CE in the whole life cycle of each memory unit and also store the original log file of the CE or the CE records output according to a specified format; or after determining whether each memory unit has the hard fault, storing whether each memory unit has the hard fault; and after determining the fault type (error pattern) of each memory unit, the fault type of each memory unit can be stored. In addition, the number of times of occurrence of a CE in the full life cycle of a memory unit means that the reported CE is a CE caused by a failure of the memory unit.
For a failure database, updating the failure database refers to updating the data stored in the failure database according to the CE. If only the number of times of CE occurrence in the full life cycle of each memory unit is stored in the failure database, the number of times of CE occurrence in the full life cycle of the memory unit is +1 in the case where any memory unit causes CE, and other cases (where the failure database also stores other data) are similar, which will not be described herein again.
It should be further noted that, when a memory bank is replaced, the corresponding failure database should also be reset along with the replacement of the memory bank, so that the records in the failure database are more accurate.
After explaining each sentence involved in step 201, a specific implementation of step 201 will be explained next, but of course, the specific implementation of step 201 is not limited thereto.
In the related art, a daemon process runs on a kernel (kernel) and can read and write data in the kernel, and the kernel data is relatively high in risk in the process of writing, so that a system is prone to down, therefore, in order to avoid the system down, the method shown in the specification can be applied to a user mode, only the data of the kernel needs to be read, a fault database is not located in the kernel, and the kernel does not need to be written. In order to make the process running in the user mode aware of the CE, the method in this specification further includes monitoring the reported CE, and then outputting a CE record according to a specified format, so that the user mode process can know the occurrence of the CE.
In other words, step 201 specifically includes: acquiring a log file corresponding to CE generated by a memory unit; analyzing the acquired log file, and outputting a fault record in a preset format; the fault record with the preset format at least comprises the address of the memory unit of the CE; and storing the fault records with the preset format into a fault database, and updating the times of CE occurrence of the corresponding memory unit in the fault database.
Step 203, for each memory unit, determining that a hard fault exists in the memory unit under the condition that the frequency of occurrence of CE in the full life cycle of the memory unit in the failure database exceeds a frequency threshold.
In other words, the hard fault can be processed by comparing the number of times of CE occurrence of each memory cell in the full life cycle with the number threshold, and determining that the memory cell exceeding the number threshold has the hard fault.
The reason why the number of CEs is compared with the number threshold value will be described first. For memory failure, it is not known whether the failure of each memory unit is hard fault or soft fault unless the memory bank is detached from the host and detected by the machine. Therefore, considering that the soft fault is a transient damage, the hard fault is a permanent damage, and for the soft fault, because the fault is recovered, the probability that the system accesses the memory unit for multiple times in the fault time period is low, and the probability that the same memory unit has multiple soft faults is low, the memory unit in which only a few CEs have occurred can be regarded as the memory unit in which the soft fault exists, and the memory unit in which multiple CEs have occurred is determined as the hard fault.
The number threshold may be set as needed, and may be set to 2. When the server running time is long, the frequency threshold may be set to any value greater than 2, in consideration of the fact that the probability of soft fault occurring in each memory unit increases.
Regarding the execution timing of step 201 and step 203, step 203 may be executed after the failure database is updated in step 201, or the execution timing of step 203 may be unrelated to step 201, for example, step 203 is executed according to a period independent from step 201.
It should be noted that, in this specification, the more accurate detection of the hard fault is a result of a combined action of the detection based on the full life cycle and the detection based on the granularity of the memory unit, and if only one of the detection and the detection is based on the granularity of the memory unit, the identification accuracy of the hard fault cannot be improved.
Specifically, if the detection is performed with the memory page as the granularity based on the fault record of the full life cycle, because the probability of soft fault occurring in different memory units in the memory page is high, more soft faults are easily identified as hard faults by mistake, and the available space of the memory is further reduced. If the hard fault is detected with the granularity of the memory unit based on the fault records in a certain time, a part of the hard fault may not be identified due to too short a predetermined time, which may result in missed detection of the hard fault.
According to the method, the times of CE occurrence of each memory unit in the whole life cycle are counted, and the hard fault is identified based on the times, and the probability of identifying the soft fault into the hard fault is reduced because the times of CE are recorded by taking the memory units as granularity. And because the hard fault is detected based on the full life cycle, the coverage rate of the hard fault identification is improved, and the hard fault is more easily identified.
In addition, after the hard fault is identified, in order to further enrich the records of the fault database so as to judge the fault degree of each memory unit and prepare for subsequent fault processing, fault types of other levels can be further determined.
Physically, a plurality of memory units form a bank (bank), and the physical location of each memory unit in the bank also affects the severity of the failure, for example, as for the bank, in the case that the number of hard fault memory units is the same, the severity of the failure may be different due to different distributions of the failed memory units, so that the failure type of each memory bank can be determined based on the bank level according to the failure result of each memory unit in the bank and the physical location of each memory unit, thereby determining the possibility that the failure of each page of the memory bank may affect the system operation according to the failure type of the memory bank, and further evaluating the probability that the failure may cause the UE.
In other words, the method further comprises: aiming at each memory bank, determining the fault type of the memory bank according to the fault record of the full life cycle of the memory unit contained in the memory bank in the fault database and the physical position relation of each memory unit; the failure type of the memory bank is used for representing the severity of the failure of the memory bank.
For example, how to determine the failure type of the memory bank is described with reference to fig. 2B. Wherein, the large square box in fig. 2B represents a memory bank small square box representing a memory cell in which CE occurs, the number next to the small square box represents the number of times CE occurs, and n in fig. 2B is an arbitrary value greater than 1.
If only a small number (for example, 1) of memory units in a memory bank have a CE, and there is no hard fault in the memory unit where the CE occurs, it may be determined that the fault type of the memory unit is a single fault, and the fault type identifies that the fault of the memory bank is not serious. In this case, when the count threshold is 1, it is shown in fig. 2B (a).
If only a small number of memory cells in a memory bank have a CE (and there is no faulty row or faulty column), but there is at least one hard fault memory cell in the memory cells having a CE, it may be determined that the fault type of the memory bank is a repeated fault, and the repeated fault is slightly more serious than the single fault. In this case, when the count threshold is 1, it is shown in fig. 2B (B).
In other words, if the number of memory cells in the memory bank in which a CE occurred is smaller than the first number threshold and at least 1 memory cell in which a CE occurred has a hard fault, it is determined that the failure type of the memory bank is a repeat failure. The first quantity threshold may be obtained according to actual requirements, for example, 2 may be obtained, and other values may also be obtained, which is not limited in this specification.
If there is a memory bank: if a plurality of memory cells in one row or one column have CE and there is no hard fault in the memory cell having CE, it is determined that there is a faulty row or faulty column in the memory bank. If there is a failed row or column in a bank (and there is no severely failed row or column), the failure type of the bank is considered as row failure/column failure. Row/column failures are a bit more severe than repeat failures. This is the case when the count threshold is 1, as shown in fig. 2B (c).
In other words, if there is a row or column fault in the memory bank, determining that the fault type of the memory bank is a row or column fault; the number of the memory units of which the CEs occur in the fault row or the fault column is greater than or equal to a first number threshold, and hard fault does not exist in the fault row or the fault column.
If a plurality of memory units which have CE in one row or one column exist and hard fault exists in at least one memory unit, the memory bank is considered to have a serious fault row or a serious fault column, if the number of the serious fault row or the serious fault column does not exceed a second number, the fault type of the memory bank is considered to be serious row fault/serious column fault, and the serious row fault/serious column fault is more serious than the row fault/column fault. In this case, when the count threshold is 1, it is shown in fig. 2B (d).
In other words, if there is a row or a column with a serious fault in the memory bank and the number of rows with a serious fault and/or columns with a serious fault is less than the second number threshold, it is determined that the fault type of the memory bank is a row or column fault; the number of the memory units in the severely failed row or the severely failed column, in which the CE occurs, is greater than or equal to a first number threshold, and at least one memory unit in the severely failed row or the severely failed column has hard fault.
In addition, if a large number of memory cells in the memory bank have a CE (e.g., more than 100) and the total number of rows and columns in which the CE memory cell distribution has occurred exceeds a large value (e.g., more than 50), then the memory cell is considered to have a serious failure, which is more serious than the serious row failure/serious column failure. In this case, when the count threshold is 1, it is shown in fig. 2B (e).
In other words, when the number of memory cells in the memory bank where the CE has occurred is greater than the third number threshold and the total number of rows and columns where the CE has occurred is greater than the fourth number threshold, it is determined that the failure type of the memory bank is a serious failure; the third quantity threshold is greater than the first quantity threshold, and the fourth quantity threshold is greater than the second quantity threshold.
The method of determining the failure type of the bank is not limited to this, and the failure type of the bank may be determined based on other methods.
Therefore, the fault type of each memory bank with faults can be determined, and the severity of the faults of the memory units can be judged better.
In addition, considering that a plurality of memory banks form a dim, the fault type can be further identified based on the dim level, so that when the fault of the whole dim is serious, the whole dim can be processed.
In other words, the method further comprises: aiming at each dual-in-line storage module dim, determining the fault type of the dim according to the fault type of each memory bank contained in the dim; the type of dim fault is used to characterize the severity of the dim fault.
Specifically, on the basis that the failure types of the memory library include single failure, repeated failure, row failure, column failure, serious row failure, serious column failure, and serious failure, how to determine the failure type of the dimm may be based on the following scheme:
and under the condition that only 1 memory bank has faults on the whole dim and the fault type of the memory bank is a single fault, determining that the fault type of the dim is a simple single fault.
And under the condition that a plurality of memory banks have faults on the whole dim and the fault types of the memory banks are all single faults, determining that the fault type of the dim is a mixed single fault. Mixed single faults are more severe than simple single faults.
And under the condition that only 1 memory bank has faults on the whole dim and the fault type of the memory bank is a repeated fault, determining that the fault type of the dim is a simple repeated fault. Simple repeated faults are more severe than mixed single faults.
And when a plurality of memory banks have faults on the whole dim and the fault type of the most serious fault in the memory banks is a repeated fault, determining that the fault type of the dim is a mixed repeated fault. Mixed repeat failures are more severe than simple repeat failures.
When 1 memory bank has faults on the whole dim and the fault type of the most serious fault in the memory banks is a row fault or a column fault, determining that the fault type of the dim is a simple row-column fault, wherein the simple row-column fault is more serious than a mixed repeated fault.
And when a plurality of memory banks have faults on the whole dim and the fault type of the most serious fault in the memory banks is a row fault or a column fault, determining that the fault type of the dim is a mixed row-column fault. Mixed rank faults are more severe than simple rank faults.
And under the condition that only 1 memory bank has faults on the whole dim and the fault type of the memory bank is serious row fault/serious column fault, determining that the fault type of the dim is single serious row column fault. A single severe line fault is more severe than a hybrid line fault.
When a plurality of memory banks have faults on the whole dim and the fault type of the most serious fault in the memory banks is a serious row fault or a serious column fault, determining that the fault type of the dim is a mixed serious column fault, wherein the mixed serious column fault is more serious than a single serious column fault.
And under the condition that only 1 memory bank has faults on the whole dim and the fault type of the memory bank is a serious fault, determining that the fault type of the dim is a single serious fault which is more serious than the mixed serious row fault.
When a plurality of memory banks have faults on the whole dim and the fault type of the most serious fault in the memory banks is a serious fault, determining that the fault type device of the dim is a mixed serious fault, wherein the mixed serious fault is more serious than a single serious fault.
After identifying the hard fault, the hard fault needs to be processed so as not to affect the normal operation of the system.
For hard fault processing, memory fault isolation is generally used to avoid affecting the normal operation of the operating system. Of course, there may be other processing manners for the memory unit or memory page with hard fault, and this description is only given by taking memory isolation as an example.
For memory fault isolation, only memory cells where hard fault occurs may be isolated. Certainly, in the related art, a fault isolation interface page provided by an existing core exists, and therefore, based on the interface, a memory page with a serious fault may be isolated according to a fault of a memory unit included in each memory page.
Next, a memory fault isolation method in this specification will be described in detail, as shown in fig. 3, where fig. 3 is a flowchart of a memory fault isolation method according to an exemplary embodiment of this specification, and includes the following steps:
step 301, determining a failure result of each memory cell according to the memory failure identification method.
The result type of each memory cell is used for representing whether the hard fault exists in each memory cell.
In other words, according to the memory fault identification method, the memory unit with the hard fault can be determined, so that fault isolation is further completed.
Step 303, determining a failed memory page according to the memory unit failure result corresponding to each memory page.
Wherein the probability that the faulty UE is not correctable due to the fault of the faulty memory page is greater than a probability threshold.
Specifically, for a single memory unit, the more times of repeatedly reporting the CE indicates that the CE has a more serious influence on the system operation, so the severity of the influence on the system operation needs to be determined according to the times of CE occurrence of the memory unit.
In addition, considering that a memory page is a memory page divided by an operating system, two memory units with consecutive addresses in the memory page may not have consecutive physical distributions in a memory bank, and therefore, a failed memory page may be determined based on the physical distribution of memory unit failures.
For example, in some cases, the number of memory cells having a hard fault in two memory pages is the same, but the memory cells having a hard fault in one memory page are concentrated in a few banks, and the memory cells having a hard fault in another memory page are dispersed in multiple banks, in which case the probability of causing the UE by the two memory pages is different, so that the failed memory page needs to be determined according to the failure type of the bank.
In addition, it is not necessary to isolate the memory unit with a fault in the memory page, and because the memory fault is isolated according to the memory page, it is considered that if only one memory unit of a certain memory page has a hard fault, in this case, the memory unit has little influence on the system operation, and if the memory page is directly isolated, the memory space is wasted, so that the available memory is reduced. Therefore, what needs to be isolated is memory pages that have a large impact on operating system operation, such as memory pages that may be more likely to cause UEs, or memory pages that may cause frequent CE.
Specifically, how to determine a failed memory page according to the failure type of the memory bank may be set according to actual conditions, for example, if a certain application has a low tolerance to a memory failure, a memory page related to a failed row or a failed column may be isolated under the condition that a row failure or a column failure exists in the memory bank, and if a certain application has a high tolerance to a memory failure, a related memory page may be isolated under the condition that a failure in the memory bank is more serious. The method for determining a failed memory page in this specification is not limited, and any method that determines a failed memory page in this specification can be used as long as the probability that the determined failed memory page causes the UE is greater than the probability threshold.
Step 305, isolating the failed memory page.
It should be noted that, the memory page is isolated, that is, the memory page is offline through the page offline interface, so that the operating system cannot continue to use the memory page, and the memory page is isolated from the available memory pages. Specifically, the operating system stores an accessible page table in which pages accessible to the operating system are recorded, and when a memory page is offline, the memory page is deleted from the accessible page table, or the accessible page table marks the memory page as a bad page, so that the operating system cannot continue to access the memory page.
In addition, considering that when the page offline memory page is in the offline memory page, if the process is using the memory page requiring offline, the page offline may be failed, and in order to make the page offline successful, the page offline may be executed in a loop under the condition that the page offline fails until the memory page requiring offline is successfully offline.
In other words, step 305 includes: calling a preset isolation memory page interface to isolate the fault memory page; under the condition that the isolation of the fault memory page fails, circularly executing the following steps until the isolation of the fault memory page is successful: and calling a preset isolation memory page interface to isolate the fault memory page when the specified period is reached.
By the method, the memory pages which have larger influence on the system operation are isolated, so that the available memory pages are more sufficient, and meanwhile, the memory pages with more serious faults are isolated, so that the system operation efficiency is not influenced. It should be noted that although the method in this specification needs to store the failure database, for a cloud computing scenario, the method is more related to the accuracy and coverage of memory failure identification, so that although the method described above has more failure databases than the related art, the method can well solve the above-mentioned problems encountered in the cloud computing field, and is worthy for the cloud computing scenario.
After the two methods provided in this specification are explained, it is also necessary to explain execution subjects of the two methods. The method can be operated by a single machine, and can also be applied to a central device for managing a plurality of electronic devices.
Specifically, in a scenario where there are multiple electronic devices and a central device managing the multiple electronic devices, in order to make identifying the hard fault or memory isolation faster and more efficient (if communication with a remote device is required, the hard fault is determined by a remote other device, and is slower), either or both of the above two methods may be applied to a single electronic device, used for identifying a memory unit on the electronic device where the hard fault exists, and/or used for taking a memory page on the electronic device offline.
Furthermore, the accuracy of identifying hard faults may be compromised if the fault database stored on the electronic device is lost if the electronic device fails during operation. Thus, to improve coverage of identifying hard faults, either or both of the two methods may be performed by the central facility.
It should be noted that, if the central device executes the two methods, the fault database stores fault records with a memory unit as a granularity in a full life cycle of a plurality of devices managed by the central device, the updating of the fault database in step 201 is executed on the basis of obtaining the CE reported by each device, and the isolating of the memory page in step 305 is to issue an instruction for isolating the memory page to the corresponding electronic device, and the method is executed by the memory page.
In other words, the present specification also provides a memory fault isolation system including a plurality of servers, and a central device for managing the plurality of servers. And the server reports the CE to the central equipment under the condition that the CE occurs. The central device is at least used for identifying the memory unit with the hard fault by the memory fault identification method, or determining the memory page needing to be isolated by the memory fault isolation method, and informing the server of isolating the memory page.
In addition, the method can be operated by the central equipment and the single machine together, so that the fault isolation effect is better.
The memory fault identification method and the memory fault isolation method provided in the present specification will be described in detail through an embodiment.
The method is applied to a central device for managing a plurality of servers, as shown in fig. 4A, each server has an error data analysis component and a fault isolation execution component, and the central device has a fault type calculation component, a fault database and a fault isolation scheduling component, and each component has the following functions.
The error data analysis component 401 is responsible for monitoring error data reported by the service based on in-band and out-of-band (partial), performing data analysis on original log and register data in the memory fault, and sending the fault data to the central device in a unified structured format, so as to calculate error pattern.
In-band refers to CE data uploaded by an operating system, and out-of-band refers to CE data uploaded by a Baseboard Management Controller (BMC), and since a CE may be uploaded by an operating system or by a BMC, the CE data reported by the operating system and the BMC needs to be monitored.
The original log in the memory failure and the data in the register are both CE data, which are different from each other, so that the data of both needs to be synthesized to output the failure data.
The raw log and the register data can be obtained through the edac or the mcelog. The output fault data in the unified format at least comprises the space information of the memory unit corresponding to the CE, the influence range of the CE and the like.
The failure type calculating component 411 receives the CE record uploaded by the error data analyzing component, updates the failure database 413 on the central device, and determines whether the hard fault exists in the memory unit of the CE based on all the failure data on the server. And based on the fault record of each memory unit, the fault type of each memory bank is determined, so that the fault distribution, the influence range of the fault and the like can be determined, and after the fault type of each memory bank is determined, the fault type of the memory bank is also updated to the fault database 413.
The fault isolation scheduling component 412 has three functions, and first, determines the fault type of each memory page according to the fault type of each memory unit included in each memory page, thereby determining the memory page to be isolated.
Secondly, the server is responsible for scheduling page offline, that is, issuing a page offline instruction to the server. In addition, considering that the kernel page of the kernel is an operation of a try best, it cannot be ensured that one page of the kernel can be isolated completely, a scheduler is designed to support a strategy based on event triggering and timing cycle scheduling, a component triggers isolation according to a fault type of a memory page stored in a fault database, a reason of the isolation failure is determined when the isolation execution fails, and the isolation operation is executed by scheduling again at a proper time. If the isolation is successful, the isolation status of the memory page may also be recorded in the fault database 413. In addition, when the server is restarted to be online, the memory pages which the server has isolated in the past can be isolated.
Thirdly, maintaining the fault database, for example, after the memory bank is repaired, the fault data corresponding to the memory bank needs to be cleared in the fault database.
And the failure database 413 is configured to record a failure record of a full life cycle of each memory cell, where the failure record may include the number of times of occurrence of a CE, failure data output by the memory cell 401, a failure type of the memory cell, the number of memory cells of each failure type included in each memory page, an isolation status of the memory page, and the like.
Finally, the fault isolation execution component 402 is configured to invoke a page offline interface of the kernel to implement isolation of the memory page, when receiving a page offline instruction issued by the server; reporting the reason of the isolation failure to a server under the condition of the memory isolation failure; and reporting the information of successful isolation to the server when the isolation is successful.
It should be further noted that different components in the same device may be executed in parallel or in series, and the components are separated in this specification to explain how each step is executed for convenience, and do not represent a limitation on the components in this specification.
After the methods illustrated in the present specification are explained by functional descriptions of a plurality of components, the two methods illustrated in the present specification will be described in detail by processing.
The interaction between any server and the central device is shown in FIG. 4B:
step 431, the server parses the original fault data, outputs the formatted fault data and sends it to the central device.
Step 441, after receiving the reported failure data, the central device determines the failure type of the memory unit corresponding to the failure data, and updates the failure data, the CE number of the memory unit, and the failure type to a failure database.
Step 442, for each memory page, the central device determines whether the memory fault isolation criterion is met, and if so, adds the isolation task to the fault isolation scheduling queue.
Step 443, the central device sends a page offline instruction to the server according to the fault isolation scheduling queue.
In step 432, the server invokes a page offline interface of the kernel according to the instruction sent by the central device, so that the corresponding memory page is offline.
In step 433, the server feeds back the result and reason of the offline success or offline failure (reason of offline failure) to the central device.
And step 444, the central device updates the offline state of the corresponding memory page in the fault database under the condition that the offline is successful, and adds the offline task to the fault isolation scheduling queue again under the condition that the offline is failed.
Corresponding to the embodiments of the foregoing method, the present specification further provides embodiments of a memory fault identification device, a memory fault isolation device, and an electronic device applied thereto.
As shown in fig. 5, fig. 5 is a block diagram of a memory failure recognition apparatus shown in this specification according to an exemplary embodiment, a failure database is configured in advance, where failure records of a full life cycle of all memory units are recorded in the failure database, and each memory unit failure record at least includes a number of times of occurrence of a CE in the full life cycle of the memory unit; the device comprises:
a failure database updating module 510, configured to update the failure database in case of a correctable error CE occurring in the memory unit.
A memory failure identifying module 520, configured to determine, for each memory unit, that a hard fault exists in the memory unit when the number of times that a CE occurs in the full life cycle of the memory unit in the failure database exceeds a threshold.
The failure database updating module 510 is specifically configured to: acquiring a log file corresponding to CE generated by a memory unit; analyzing the acquired log file, and outputting a fault record in a preset format; the fault record with the preset format at least comprises the address of the memory unit of the CE; and storing the fault records with the preset format into a fault database, and updating the times of CE occurrence of the corresponding memory unit in the fault database.
The apparatus may further include a memory failure type determining module 530 (not shown in the figure), configured to determine, for each memory, a failure type of the memory according to a failure record of a full life cycle of a memory unit included in the memory in a failure database and a physical location relationship of each memory unit; the failure type of the memory bank is used for representing the severity of the failure of the memory bank.
The memory bank fault type determining module 530 is specifically configured to: determining that the fault type of the memory bank is a repeated fault if the number of the memory units with CE in the memory bank is smaller than a first number threshold and at least 1 memory unit with CE has a hard fault; if a fault row or fault column exists in the memory bank, determining the fault type of the memory bank as a row or column fault; the number of the memory units of which the CEs occur in the fault row or the fault column is greater than or equal to a first number threshold, and hard faults do not exist in the fault row or the fault column; if a row or a column with serious faults exists in the memory bank and the number of the row or the column with serious faults is less than a second number threshold, determining that the fault type of the memory bank is a row or a column fault; the number of the memory units in the serious fault row or the serious fault column, in which the CE occurs, is greater than or equal to a first number threshold, and at least one memory unit in the serious fault row or the serious fault column has hard fault; determining that the fault type of the memory bank is a serious fault under the condition that the number of the memory cells of which the CE occurs in the memory bank is greater than a third number threshold and the total number of the rows and the columns of which the CE occurs is greater than a fourth number threshold; the third quantity threshold is greater than the first quantity threshold, and the fourth quantity threshold is greater than the second quantity threshold.
On the basis of the memory bank fault type determining module 530, the apparatus may further include a dimmm fault type determining module 540 (not shown in the figure), which is configured to: aiming at each dual-in-line storage module dim, determining the fault type of the dim according to the fault type of each memory bank contained in the dim; the type of dim fault is used to characterize the severity of the dim fault.
As shown in fig. 6, fig. 6 is a block diagram of a memory fault isolation apparatus according to an exemplary embodiment, where the memory fault isolation apparatus includes:
a failure result determining module 610, configured to determine a failure result of each memory unit according to the memory failure identification method; the result type of each memory cell is used for representing whether the hard fault exists in each memory cell.
A faulty memory page determining module 620, configured to determine a faulty memory page according to a fault result of a memory unit corresponding to each memory page; wherein the probability that the faulty UE is not correctable due to the fault of the faulty memory page is greater than a probability threshold.
A memory page isolation module 630 for isolating the failed memory page
The memory page isolation module 630 is specifically configured to invoke a preset isolated memory page interface to isolate the faulty memory page; under the condition that the fault memory page isolation fails, circularly executing the following steps until the fault memory page isolation succeeds: and calling a preset isolation memory page interface to isolate the fault memory page when the specified period is reached.
The implementation process of the functions and actions of each module in the above device is detailed in the implementation process of the corresponding steps in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
As shown in fig. 7, fig. 7 is a hardware structure diagram of an electronic device in which a memory fault recognition apparatus or a memory fault isolation apparatus according to an embodiment is located, where the electronic device may include: a processor 1010, a memory 1020 for storing processor-executable instructions, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU, a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to run the executable instructions to implement the memory fault identification method or the memory fault isolation method.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component within the device (not shown) or may be external to the device to provide corresponding functionality. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Embodiments of the present specification further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for identifying a memory fault or the method for isolating a memory fault is implemented.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims (12)

1. A memory failure identification method is characterized in that a failure database is configured in advance, failure records of the full life cycle of all memory units are recorded in the failure database, and each memory unit failure record at least comprises the times of CE occurrence in the full life cycle of the memory unit; the method comprises the following steps:
updating a fault database under the condition that the memory unit generates a correctable error CE;
and aiming at each memory unit, determining that the hard fault exists in the memory unit under the condition that the frequency of occurrence of CE in the full life cycle of the memory unit in the fault database exceeds a frequency threshold.
2. The method of claim 1, the updating the failure database in the event of a correctable error CE in the memory unit, comprising:
acquiring a log file corresponding to CE generated by a memory unit;
analyzing the acquired log file, and outputting a fault record in a preset format; the fault record with the preset format at least comprises the address of the memory unit of the CE;
and storing the fault records with the preset format into a fault database, and updating the times of CE occurrence of the corresponding memory unit in the fault database.
3. The method of claim 1, further comprising:
aiming at each memory bank, determining the fault type of the memory bank according to the fault record of the memory unit full life cycle contained in the memory bank in the fault database and the physical position relation of each memory unit; the failure type of the memory bank is used for representing the severity of the failure of the memory bank.
4. The method according to claim 3, wherein the determining the fault type of the memory bank according to the fault record of the memory unit full life cycle included in the memory bank in the fault database and the physical location relationship of each memory unit includes:
determining that the fault type of the memory bank is a repeated fault if the number of the memory units with CE in the memory bank is smaller than a first number threshold and at least 1 memory unit with CE has a hard fault;
if a fault row or fault column exists in the memory bank, determining the fault type of the memory bank as a row or column fault; the number of the memory units of which the CEs occur in the fault row or the fault column is greater than or equal to a first number threshold, and hard faults do not exist in the fault row or the fault column;
if a row or a column with serious faults exists in the memory bank and the number of the row or the column with serious faults is less than a second number threshold, determining that the fault type of the memory bank is a row or a column fault; the number of the memory units in the serious fault row or the serious fault column, in which the CE occurs, is greater than or equal to a first number threshold, and at least one memory unit in the serious fault row or the serious fault column has hard fault;
determining that the fault type of the memory bank is a serious fault under the condition that the number of the memory cells of which the CE occurs in the memory bank is greater than a third number threshold and the total number of the rows and the columns of which the CE occurs is greater than a fourth number threshold; the third quantity threshold is greater than the first quantity threshold, and the fourth quantity threshold is greater than the second quantity threshold.
5. The method of claim 3, further comprising:
aiming at each dual-in-line storage module dim, determining the fault type of the dim according to the fault type of each memory bank contained in the dim; the type of dim fault is used to characterize the severity of the dim fault.
6. A method of memory fault isolation, the method comprising:
the method of any of claims 1-5, determining a failure result for each memory cell; the result type of each memory unit is used for representing whether the hard fault exists in each memory unit;
determining a fault memory page according to the fault result of the memory unit corresponding to each memory page; wherein the probability that the fault of the fault memory page causes uncorrectable wrong UE is greater than a probability threshold;
isolating the failed memory page.
7. The method of claim 6, the isolating the failed memory page, comprising:
calling a preset isolation memory page interface to isolate the fault memory page;
under the condition that the isolation of the fault memory page fails, circularly executing the following steps until the isolation of the fault memory page is successful:
and calling a preset isolation memory page interface to isolate the fault memory page when the specified period is reached.
8. A memory failure recognition device is provided with a failure database in advance, wherein failure records of the full life cycle of all memory units are recorded in the failure database, and each memory unit failure record at least comprises the times of CE occurrence in the full life cycle of the memory unit; the device comprises:
the failure database updating module is used for updating the failure database under the condition that the memory unit generates a correctable error CE;
and the memory failure identification module is used for determining that the hard fault exists in each memory unit under the condition that the frequency of occurrence of CE in the memory unit in the whole life cycle of the memory unit in the failure database exceeds a frequency threshold.
9. A memory fault isolation apparatus, the apparatus comprising:
a failure result determination module for determining a failure result of each memory cell according to the method of any one of claims 1 to 5; the result type of each memory unit is used for representing whether the hard fault exists in each memory unit;
a failure memory page determining module, configured to determine a failure memory page according to a memory unit failure result corresponding to each memory page; wherein the probability that the fault of the fault memory page causes uncorrectable wrong UE is greater than a probability threshold;
and the memory page isolation module is used for isolating the fault memory page.
10. A memory fault isolation system comprises a plurality of servers and a central device for managing the servers;
the server reports the CE to the central equipment under the condition that the CE occurs;
the central device is at least configured to identify a memory unit with a hard fault according to the memory fault identification method in any one of claims 1 to 5, or determine a memory page that needs to be isolated according to the memory fault isolation method in claim 6 or 7, and notify the server of isolating the memory page.
11. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the memory fault identification method according to any one of claims 1 to 5 or implements the memory fault isolation method according to claim 6 or 7 by executing the executable instructions.
12. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the memory fault identification method of any one of claims 1-5 or implement the memory fault isolation method of claim 6 or 7.
CN202210351887.3A 2022-04-02 2022-04-02 Memory fault identification method and memory fault isolation method Pending CN114860487A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210351887.3A CN114860487A (en) 2022-04-02 2022-04-02 Memory fault identification method and memory fault isolation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210351887.3A CN114860487A (en) 2022-04-02 2022-04-02 Memory fault identification method and memory fault isolation method

Publications (1)

Publication Number Publication Date
CN114860487A true CN114860487A (en) 2022-08-05

Family

ID=82630592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210351887.3A Pending CN114860487A (en) 2022-04-02 2022-04-02 Memory fault identification method and memory fault isolation method

Country Status (1)

Country Link
CN (1) CN114860487A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115629905A (en) * 2022-12-21 2023-01-20 苏州浪潮智能科技有限公司 Memory fault early warning method and device, electronic equipment and readable medium
CN115686901A (en) * 2022-10-25 2023-02-03 超聚变数字技术有限公司 Memory fault analysis method and computer equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686901A (en) * 2022-10-25 2023-02-03 超聚变数字技术有限公司 Memory fault analysis method and computer equipment
CN115686901B (en) * 2022-10-25 2023-08-04 超聚变数字技术有限公司 Memory fault analysis method and computer equipment
CN115629905A (en) * 2022-12-21 2023-01-20 苏州浪潮智能科技有限公司 Memory fault early warning method and device, electronic equipment and readable medium

Similar Documents

Publication Publication Date Title
US8108724B2 (en) Field replaceable unit failure determination
US8250543B2 (en) Software tracing
CN108536548B (en) Method and device for processing bad track of disk and computer storage medium
CN114860487A (en) Memory fault identification method and memory fault isolation method
US20090199056A1 (en) Memory diagnosis method
CN115629905B (en) Memory fault early warning method and device, electronic equipment and readable medium
WO2017079454A1 (en) Storage error type determination
JP2015106334A (en) Fault symptom detection method, information processing apparatus, and program
US10789148B2 (en) Electronic device and method for event logging
CN113625945A (en) Distributed storage slow disk processing method, system, terminal and storage medium
WO2024082844A1 (en) Fault detection apparatus and detection method for random access memory
CN110989938A (en) Fault disk identification method, device, equipment and computer readable storage medium
CN112579327B (en) Fault detection method, device and equipment
JP2017091077A (en) Pseudo-fault generation program, generation method, and generator
CN111221775B (en) Processor, cache processing method and electronic equipment
CN113590405A (en) Hard disk error detection method and device, storage medium and electronic device
CN105868038B (en) Memory error processing method and electronic equipment
CN113625957B (en) Method, device and equipment for detecting hard disk faults
CN115509786A (en) Method, device, equipment and medium for reporting fault
US10389660B2 (en) Identifying reports to address network issues
US20230025081A1 (en) Model training method, failure determining method, electronic device, and program product
CN114461436A (en) Memory fault processing method and device and computer readable storage medium
CN114003612A (en) Processing method and processing system for abnormal conditions of database
CN117407207B (en) Memory fault processing method and device, electronic equipment and storage medium
CN110544504A (en) test method, system and equipment for memory ADDDC function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination