CN118467224A - Memory fault prediction method, device, equipment and readable storage medium - Google Patents

Memory fault prediction method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN118467224A
CN118467224A CN202410599032.1A CN202410599032A CN118467224A CN 118467224 A CN118467224 A CN 118467224A CN 202410599032 A CN202410599032 A CN 202410599032A CN 118467224 A CN118467224 A CN 118467224A
Authority
CN
China
Prior art keywords
fault
information
record information
memory
physical address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410599032.1A
Other languages
Chinese (zh)
Inventor
金澄锦
王越
陈昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN202410599032.1A priority Critical patent/CN118467224A/en
Publication of CN118467224A publication Critical patent/CN118467224A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The present specification provides a memory failure prediction method, apparatus, device, and readable storage medium, where the method includes: acquiring physical position information of a minimum storage unit corresponding to a memory fault; establishing fault record information carrying system physical address and physical position information, and storing the fault record information in groups according to preset rules; if a group with a fault value exceeding a preset threshold exists, predicting a system physical address set of the target isolation area through a preset algorithm according to fault record information of the group; and isolating the memory space of the target isolation region according to the predicted system physical address set of the target isolation region. According to the technical scheme, fault records are intelligently grouped according to the preset rules, fault trends of all groups are monitored, and once faults exceeding a set threshold value are detected, the memory area needing to be isolated is predicted and determined by utilizing an advanced algorithm, so that the accuracy of fault prediction is improved, and the efficiency of memory fault processing and the stability of a system are improved.

Description

Memory fault prediction method, device, equipment and readable storage medium
Technical Field
The present disclosure relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for predicting a memory failure.
Background
With the explosion of cloud computing technology, servers are increasingly gaining importance as an infrastructure to support cloud services. The high efficiency and accuracy of server management are critical for maintenance personnel, which not only requires a server management center (BMC) to monitor and accurately identify hardware states in real time, but also requires quick response and processing of fault information, and ensures continuity and stability of service. The memory is used as a key component of the server, and the health condition directly influences the running efficiency of the operating system and the normal development of the service. Therefore, ensuring the reliability of the memory and being able to quickly locate and take isolation measures when problems occur becomes an indispensable ring in the reliability engineering of the server.
In one technical scheme, in the field of memory management, an operating system isolates a potential risk area through a Page Offline function to prevent fault expansion, and the function depends on accurate control of physical memory addresses. Typically, the physical address (PHYSICAL ADDRESS) under the operating system is closely related to the page frame (PAGE FRAME) of memory, from which the operating system organizes and manages memory, each page having a default size of 4KB. When a specific memory area needs to be isolated, the operating system performs page-level isolation operation according to the physical address. However, this is accomplished on the premise that the actual physical location information of the memory (e.g., row, column, bank, etc. of the DRAM) and the system address used by the operating system can be accurately translated to each other.
In one embodiment, for example, some Intel platform processors provide an out-of-band failure collection mechanism, which enables bidirectional resolution of memory location information to system addresses through BIOS. Although effective, there are several significant limitations to this approach: first, it relies on complex memory address translation algorithms, which are limited by the specific configuration of the BIOS (e.g., memory interleaving, NUMA setup, etc.), increasing implementation complexity; secondly, the conversion method has insufficient universality and poor adaptability to different processor platforms, and a CPU manufacturer is often required to provide special conversion algorithm support; if the CPU platform does not support bidirectional conversion, the scheme is completely disabled, and the possibility of wide application of the scheme is limited.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a memory failure prediction method, apparatus, electronic device, and readable storage medium, so as to at least improve one of the above technical problems.
The technical scheme is as follows:
The specification provides a memory failure prediction method, which is applied to computer equipment, and comprises the following steps: acquiring physical position information of a minimum storage unit Cell corresponding to the memory fault according to a system physical address corresponding to the memory fault; establishing fault record information carrying system physical address and physical position information, and storing the fault record information in groups according to preset rules; monitoring fault record information of each group, and if the group with the fault value exceeding a preset threshold exists, predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group; and isolating the memory space of the target isolation region according to the predicted system physical address set of the target isolation region.
As a technical solution, the obtaining physical location information of the minimum storage unit Cell corresponding to the memory failure according to the system physical address corresponding to the memory failure includes: and analyzing the corresponding address information NormalizedAddress, interface information NormalizedSocketid, particle information NormalizedDieId and channel information NormalizedChannelid according to the system physical address PHYSICALADDRESS corresponding to the memory fault, analyzing the corresponding library information and Row Column information according to the address information, and summarizing the information to obtain the physical position information of the minimum storage unit Cell corresponding to the memory fault.
As a technical solution, the establishing fault record information carrying system physical address and physical location information, and storing the fault record information according to a preset rule group includes: and positioning to a Row-Column position of a physical position where a memory fault occurs on memory hardware according to interface information, channel information, particle information, bank information and Row-Column information included in the physical position information, establishing fault record information comprising a mapping relation between the Row-Column position and a system physical address, and storing each fault record information according to a Row position or a Column position in a grouping way.
As a technical solution, the monitoring fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting a system physical address set of a target isolation area according to the fault record information of the packet by a preset algorithm, includes: and if the fault record information of each group exceeds the preset threshold value, predicting all other system physical addresses mapped to the same Row position or Column position according to the system physical address mapped by the Row position or Column position according to the fault record information of each group, and taking all the predicted system physical addresses and the system physical address recorded in the fault record information of the group as a system physical address set of a target isolation area.
As a technical solution, the monitoring fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting a system physical address set of a target isolation area according to the fault record information of the packet by a preset algorithm, includes: setting fault values for each group according to the number of fault record information included in the group, and predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group if the group with the fault value exceeding a preset threshold exists.
As a technical solution, the monitoring fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting a system physical address set of a target isolation area according to the fault record information of the packet by a preset algorithm, includes: setting different fault values for the fault record information of different memory faults of different fault types and/or different accumulated fault times of the same physical position information, accumulating the fault values of each fault record information of the same packet to obtain the fault value of the packet, and if the packet with the fault value exceeding the preset threshold exists, predicting the system physical address set of the target isolation area through a preset algorithm according to the fault record information of the packet.
The present disclosure also provides a memory failure prediction apparatus, applied to a computer device, where the apparatus includes: the first module is used for acquiring physical position information of a minimum storage unit Cell corresponding to the memory fault according to the system physical address corresponding to the memory fault; the second module is used for establishing fault record information carrying the system physical address and the physical position information and storing the fault record information in groups according to preset rules; a third module, configured to monitor fault record information of each packet, and if a packet whose fault value exceeds a preset threshold exists, predict a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm; and a fourth module, configured to isolate the memory space of the target isolation area according to the predicted system physical address set of the target isolation area.
As a technical solution, the obtaining physical location information of the minimum storage unit Cell corresponding to the memory failure according to the system physical address corresponding to the memory failure includes: and analyzing the corresponding address information NormalizedAddress, interface information NormalizedSocketid, particle information NormalizedDieId and channel information NormalizedChannelid according to the system physical address PHYSICALADDRESS corresponding to the memory fault, analyzing the corresponding library information and Row Column information according to the address information, and summarizing the information to obtain the physical position information of the minimum storage unit Cell corresponding to the memory fault.
As a technical solution, the establishing fault record information carrying system physical address and physical location information, and storing the fault record information according to a preset rule group includes: and positioning to a Row-Column position of a physical position where a memory fault occurs on memory hardware according to interface information, channel information, particle information, bank information and Row-Column information included in the physical position information, establishing fault record information comprising a mapping relation between the Row-Column position and a system physical address, and storing each fault record information according to a Row position or a Column position in a grouping way.
As a technical solution, the monitoring fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting a system physical address set of a target isolation area according to the fault record information of the packet by a preset algorithm, includes: and if the fault record information of each group exceeds the preset threshold value, predicting all other system physical addresses mapped to the same Row position or Column position according to the system physical address mapped by the Row position or Column position according to the fault record information of each group, and taking all the predicted system physical addresses and the system physical address recorded in the fault record information of the group as a system physical address set of a target isolation area.
As a technical solution, the monitoring fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting a system physical address set of a target isolation area according to the fault record information of the packet by a preset algorithm, includes: setting fault values for each group according to the number of fault record information included in the group, and predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group if the group with the fault value exceeding a preset threshold exists.
As a technical solution, the monitoring fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting a system physical address set of a target isolation area according to the fault record information of the packet by a preset algorithm, includes: setting different fault values for the fault record information of different memory faults of different fault types and/or different accumulated fault times of the same physical position information, accumulating the fault values of each fault record information of the same packet to obtain the fault value of the packet, and if the packet with the fault value exceeding the preset threshold exists, predicting the system physical address set of the target isolation area through a preset algorithm according to the fault record information of the packet.
The present specification also provides an electronic device comprising a processor and a readable storage medium storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the foregoing memory failure prediction method.
The present specification also provides a readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the memory failure prediction method described above.
The technical scheme provided by the specification at least brings the following beneficial effects:
The fault records are intelligently grouped according to preset rules, fault trends of all groups are monitored, and once a certain group of faults are detected to exceed a set threshold value, the memory area needing to be isolated is predicted and determined by utilizing an advanced algorithm, so that the accuracy of fault prediction is improved, and the efficiency of memory fault processing and the stability of a system are improved.
Drawings
In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments of the present description or the description of the prior art will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in the present description, and other drawings may be obtained according to these drawings of the embodiments of the present description to those skilled in the art.
FIG. 1 is a flow chart of a memory failure prediction method in one embodiment of the present disclosure;
FIG. 2 is a schematic diagram of data transfer in one embodiment of the present description;
FIG. 3 is a schematic diagram of address resolution in one embodiment of the present description;
FIG. 4 is a schematic diagram of a memory structure according to an embodiment of the present disclosure;
Fig. 5 is a block diagram of a memory failure prediction apparatus according to an embodiment of the present specification;
Fig. 6 is a hardware configuration diagram of an electronic device in one embodiment of the present specification.
Reference numerals: a first module 21, a second module 22, a third module 23, and a fourth module 24.
Detailed Description
The terminology used in the embodiments of the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description presented herein. As used in this specification and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in the present specification to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".
In a technical scheme, a memory is used as a key component of a server, the monitoring of health conditions and fault prediction of the memory are of importance to the operation and maintenance of the server, and the monitoring of hardware information of the server and the identification of fault states can be realized through cooperation of a Baseboard Management Controller (BMC) and a basic input/output system (BIOS).
However, in some processor platforms, such as Intel (r) platform, although the conversion of memory location information to system addresses may be implemented by a bi-directional resolution algorithm provided by the BIOS, the complexity of this algorithm is high and depends on BIOS option configurations, such as memory interleaving, NUMA, etc., which increases implementation difficulty.
And the portability of the existing memory address conversion algorithm on different processor platforms is low, and the existing memory address conversion algorithm usually needs to depend on a CPU manufacturer to provide complete conversion algorithm support. If the manufacturer does not provide support, the prior art will not work.
In addition, in a system that does not support bidirectional resolution, the conversion between memory location information and system addresses becomes a difficult problem, resulting in a separation of failure prediction and repair processes, and thus, an effective cooperation is not possible.
Aiming at the limitation of the prior art, the specification aims to provide a new memory fault prediction and matching method, which can provide system address information required by an Operating System (OS) for memory repair under the condition of no explicit conversion algorithm. By the method, even on a platform which lacks support of a bidirectional conversion algorithm, effective prediction and intelligent repair of memory faults can be realized, so that the operation and maintenance efficiency and the system stability of the server are improved.
The specific technical scheme is as follows.
In one embodiment, the present disclosure provides a memory failure prediction method applied to a computer device, where the method includes: acquiring physical position information of a minimum storage unit Cell corresponding to the memory fault according to a system physical address corresponding to the memory fault; establishing fault record information carrying system physical address and physical position information, and storing the fault record information in groups according to preset rules; monitoring fault record information of each group, and if the group with the fault value exceeding a preset threshold exists, predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group; and isolating the memory space of the target isolation region according to the predicted system physical address set of the target isolation region.
Specifically, as shown in fig. 1, the method comprises the following steps:
Step S11, according to the system physical address corresponding to the memory fault, the physical position information of the minimum storage unit Cell corresponding to the memory fault is obtained.
When the memory subsystem of the server detects an error correction code (CE, correctable Error) or uncorrectable error (UCE, uncorrectable Error), fault information is first collected by a Baseboard Management Controller (BMC). The information includes the system physical address (PHYSICAL ADDRESS) at the time of the failure. Based on the address, the physical location information of the Cell, which is the smallest storage unit in the memory, can be traced back by utilizing the unidirectional parsing capability provided by the BIOS, i.e. determining which Bank, which Row (Row) and which Column (Column) the fault occurs in.
The core is to determine the exact location of the occurrence of a memory failure. In the memory of the server, the DRAM memory is composed of a plurality of memory cells, each Cell corresponding to a specific physical location. When the memory fails, the system physical address corresponding to the failure is first identified. The physical address of the system can be converted into physical location information of the memory through cooperation of the BMC and the BIOS, wherein the physical location information comprises Row, column and Bank.
For example, assume a memory failure with a physical address of 0x0012345678 is detected. Through BIOS analysis, the memory physical location corresponding to the address may be the 3 rd Bank, the 5 th Row and the 7 th Column.
Step S12, establishing fault record information carrying system physical address and physical position information, and storing the fault record information according to a preset rule group.
For convenience of management and analysis, the BMC integrates fault information and creates a fault record information base containing the system physical address and the corresponding memory Cell physical location. The records are grouped according to certain rules, for example, by the Bank, row, or frequency of failure where the failure is located. Such a grouping strategy helps to identify failure modes, such as whether a particular Bank or row is frequently corrupted, thereby providing a basis for subsequent failure prediction.
Once the physical location of the fault is determined, a fault record is created that contains the system physical address and physical location information. The records are grouped according to preset rules, such as by Bank, row, or fault type. The purpose of the grouping is to better analyze the failure mode and predict the likely failure trend. It is possible to group all fault records from the same Bank or to group records of similar fault types. In this way, the behavior of faults within a particular packet can be analyzed centrally, thereby predicting future faults more accurately.
And S13, monitoring fault record information of each group, and if the group with the fault value exceeding a preset threshold exists, predicting a system physical address set of the target isolation area through a preset algorithm according to the fault record information of the group.
The fault record information of the packets is continuously monitored, and the data is analyzed through a preset algorithm. If the failure rate of a certain packet is found to exceed a preset safety threshold, this indicates that there may be a potential hardware defect or aging in the memory area represented by the packet. At this time, the physical location information of the high fault rate region is subjected to space-time feature analysis by using an algorithm, so that one or more target isolation regions can be predicted, and the regions are highly likely to continuously cause errors in the future. The algorithm predicts a series of sets of system physical addresses representing memory pages that require special attention.
After the packets store the fault records, the fault record information for each packet will be continuously monitored. If the number of failures for a certain packet exceeds a preset threshold, a potential systematic problem will be considered to exist in the packet. At this time, a preset algorithm is used to predict a memory area that may need to be isolated based on the current fault record information, if a sudden increase of the number of fault records of a certain Row is detected, and the whole Row or a part of the Row may be selectively predicted to be isolated to prevent fault diffusion.
Step S14, according to the predicted system physical address set of the target isolation region, isolating the memory space of the target isolation region.
And according to the predicted system physical address set of the target isolation area, the operating system executes the isolation operation of the memory page. This means that the system will mark the memory pages corresponding to these addresses as unavailable or move them out of the active memory pool, avoiding that the operating system continues to allocate data in this area, thus reducing the occurrence of errors. By the method, effective isolation of the risk memory area can be realized even under the condition that no explicit memory location and system address bidirectional conversion algorithm exists, and the stability and reliability of the system are improved.
Once the target isolation region is predicted, an isolation operation of the memory space is performed. This typically involves the Page Offline function of the operating system that can safely remove the predicted failed memory region from the system to prevent the failure from affecting other parts of the system.
As in one Bank, row1 frequently reports CE errors, particularly in the column10 to column100 range. Through the above flow, the BMC collects the information, stores the information in groups, and finds that the error rate of row1 is abnormally high. The algorithm predicts the possible risk of other columns (e.g., column 210) within row1 that do not report errors directly, based on the known column10 with PHYSICALADDRESS (e.g., 0x 00012345010) at row1 and column110 with PHYSICALADDRESS (0 x 00012346010) at row1, and the corresponding PHYSICALADDRESS (e.g., 0x 00012347010) is added to the quarantine list. Then, the OS marks the corresponding memory page as unavailable according to the list, preventing the data from being written into the potentially dangerous memory areas again, and thus avoiding occurrence of errors in advance.
In one embodiment, the system physical address is captured using advanced hardware monitoring mechanisms, starting with a memory failure report. Through in-depth analysis, the physical address of the system level is mapped to the physical location of the smallest addressable unit, the memory Cell (Cell), inside the memory in combination with the physical structure information of the memory, such as Bank, row, column of the DIMM module. The conversion process depends on understanding of the memory hardware layout, ensures the accuracy of fault positioning, and lays a foundation for subsequent steps. Based on the obtained accurate position information, an exhaustive fault record database is established, and each record not only contains the system physical address, but also records the physical position information of the corresponding memory unit. The records are sorted according to a predetermined logical grouping, such as according to a Column, row distribution, or time series of failure occurrences. Such grouping strategies facilitate statistical analysis of failure modes, providing more intuitive data support for predictions. The fault record of each group is continuously monitored, and the fault frequency and distribution trend are analyzed through a built-in algorithm model. Once the failure rate of a certain packet is found to exceed a preset safety threshold, the algorithm is started immediately, and a memory area possibly at risk is predicted based on the failure mode and historical data of the packet. The algorithm predicts the system physical address set of the target isolation region by means of time sequence analysis, cluster recognition and the like, and the step is an intelligent key, so that the equipment applying the embodiment has predictability on potential faults. And executing the isolation operation of the memory page based on the predicted target isolation region system physical address set. Through the operating system interface, the memory pages corresponding to the addresses are marked as unavailable or directly moved out of the active memory pool, so that the operating system is ensured not to distribute data to the potential fault areas any more, and the risk of data loss or system breakdown is avoided. This isolation process is automatic and accurate, significantly improving the robustness and fault recovery capabilities of the system. According to the embodiment, a complete memory health management flow is formed from accurate fault positioning to intelligent prediction to preventive isolation. The method not only improves the automation level of the operation and maintenance of the server, but also realizes the effective management of memory faults under the condition that no direct memory location and system address bidirectional conversion algorithm support, and provides powerful guarantee for the stable operation of the server.
In one embodiment, the obtaining, according to the system physical address corresponding to the memory failure, physical location information of the minimum storage unit Cell corresponding to the memory failure includes: and analyzing the corresponding address information NormalizedAddress, interface information NormalizedSocketid, particle information NormalizedDieId and channel information NormalizedChannelid according to the system physical address PHYSICALADDRESS corresponding to the memory fault, analyzing the corresponding library information and Row Column information according to the address information, and summarizing the information to obtain the physical position information of the minimum storage unit Cell corresponding to the memory fault.
A system physical address of the failed memory is obtained from the operating system (PHYSICALADDRESS). This address is a unique identification that the operating system uses to access memory. Using the functions of the BIOS, PHYSICALADDRESS is converted into a series of standardized information, including:
Address information (NormalizedAddress): this is a standardized representation of the original physical address;
interface information (NormalizedSocketid): identifying a CPU slot or interface in which the memory is located;
particle information (NormalizedDieId): refers to a specific chip or particle inside the memory module;
channel information (NormalizedChannelid): representing the channel for memory data transfer.
Further resolving a specific physical location according to the standardized address information, including:
library information (Bank): a memory block in the memory module;
row information (Row) and Column information (Column) are used to determine in which particular Row or Column of the memory matrix a fault occurred.
The step of resolving is generally unidirectional resolving, i.e. the Bank, row and Column can be obtained by resolving the address information, but the address information can not be obtained by reversely resolving the Bank, row and Column and further the accurate system physical address (PHYSICALADDRESS)
And summarizing all the analyzed information to form a complete fault location description, so as to obtain the accurate physical location information of the fault Cell, and provide accurate references for subsequent fault processing and memory repair.
In one embodiment, the establishing fault record information carrying system physical address and physical location information, and storing the fault record information in groups according to a preset rule, includes: and positioning to a Row-Column position of a physical position where a memory fault occurs on memory hardware according to interface information, channel information, particle information, bank information and Row-Column information included in the physical position information, establishing fault record information comprising a mapping relation between the Row-Column position and a system physical address, and storing each fault record information according to a Row position or a Column position in a grouping way.
A systematic fault record information base is established for storing and managing the fault information
For each detected memory failure, a fault log is created. The fault record information not only comprises the system physical address corresponding to the fault, but also comprises physical position information of the fault, such as specific information of interfaces, channels, particles, banks and rows and columns.
And establishing a mapping relation in the fault record information, and associating the row and column information of the physical position with the system physical address. Based on the mapping relation generated according to the memory fault, the corresponding system physical address can be traced back through the corresponding physical position information, and vice versa.
And grouping the fault records according to a certain rule. This grouping may be based on physical location information, such as the location of a Row (Row) or Column (Column). For example, all faults occurring in the same row or column may be stored in the same group.
The fault record information stored in groups can be indexed according to the physical positions of rows or columns, so that an operation and maintenance person or an automatic system can quickly access all fault records of a specific row or column, and fault analysis and trend prediction are facilitated.
In one embodiment, the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm, including: and if the fault record information of each group exceeds the preset threshold value, predicting all other system physical addresses mapped to the same Row position or Column position according to the system physical address mapped by the Row position or Column position according to the fault record information of each group, and taking all the predicted system physical addresses and the system physical address recorded in the fault record information of the group as a system physical address set of a target isolation area.
And processing the memory faults by adopting a dynamic monitoring and predicting method, and continuously monitoring all fault records stored in groups based on physical positions (such as rows or columns). These records include system physical addresses and corresponding physical location information. To determine which packets may have problems, a failure threshold is set for each packet. This threshold is a pre-set upper limit on the number of faults beyond which the packet is considered potentially serious. The number of fault records per packet is checked periodically, and if the number of faults for a certain packet exceeds a set threshold, this packet is marked and ready for further fault prediction. For a packet that exceeds a threshold, the system physical addresses of all fault records within the packet are looked up, all associated with the physical location of the same row or column. With these associated system physical addresses, a preset algorithm is used to predict other system physical addresses on the same row or column that may not have been recorded yet but that are at risk of failure. The preset algorithm may use a trained model. By means of a preset algorithm, a group of system physical addresses which can possibly have faults are obtained in a prediction mode, and the addresses and the system physical addresses of fault records existing in the groups are considered to be a system physical address set of a target isolation area needing isolation. After the target isolation area is determined, an isolation operation is performed to isolate the memory areas associated with the predicted and recorded fault addresses so as to prevent fault diffusion and ensure stable operation of the system.
In one embodiment, the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm, including: setting fault values for each group according to the number of fault record information included in the group, and predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group if the group with the fault value exceeding a preset threshold exists.
In one embodiment, the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm, including: setting different fault values for the fault record information of different memory faults of different fault types and/or different accumulated fault times of the same physical position information, accumulating the fault values of each fault record information of the same packet to obtain the fault value of the packet, and if the packet with the fault value exceeding the preset threshold exists, predicting the system physical address set of the target isolation area through a preset algorithm according to the fault record information of the packet.
In one embodiment, based on the support capability of the BMC and the BIOS, integrating and optimizing data, based on historical data collection, integrating information, separating a data set with target isolation, and performing intelligent memory repair.
As shown in fig. 2, the bios reports the collected information to the BMC one by one, the BMC classifies the data according to the memory configuration characteristics, classifies the memory failures of the same row in the same bank according to the range of the falling blocks, screens the location information falling into each block and the matched environment system address, and selects the page frames which meet the expectations and are not repeated for marking. The mark is the target isolated address.
Memory physical location information: the DRAM is installed on the server except a certain CPU, a certain channel and a certain slot, parameters such as Rank, bank, column, row are arranged on the memory body, and the specific hardware position can be determined according to the position information. The system physical address of the fault content (e.g., CE, UCE) reported each time may be associated with physical location information.
System physical address under operating system (physical_address can be seen in demsg of OS): when Linux kernel performs memory page isolation, it will rely on the memory page frame acquired by the physical address to perform the isolation. Under the operating system, according to one memory page divided by the memory page frame, how many addresses are determined according to Pagesize sizes under the operating system. Common: if defaulting to 4KB/Page, the Page frame has 4×1024=4096 address information, and the relationship between the common physical address and the Page frame is 0x0012345010, and the Page frame number is 0x0012345, the offset is 0x010, the offset range is 0x000-0xFFF, and the total is 4096 (2≡12) addresses.
The prediction fault depends on collecting fault information carrying physical positions, and by carrying out collaborative prediction on the distribution characteristics of peripheral row and column information in time and space dimensions, a target address is transmitted to an OS for isolation repair after prediction, but as shown in FIG. 3, the target address cannot be directly converted into a system physical address used by the OS according to the physical position information.
Aiming at the problem, the capability of BIOS capable of transmitting and analyzing characteristics such as Column and Row in a single way is utilized, so that the target physical address of a prediction result is changed from unknown which cannot be reversely analyzed to known with range, and the function application is satisfied. The BMC maps the Column, row information and PHYSICALADDRESS recorded when the CE is generated, and considers that the memory Row has risk as long as the fault is from the same Row and the fault is accumulated to a certain degree. PHYSICALADDRESS recorded in this risk area is extracted and the OS is quarantined for a complete Row of speculative PHYSICALADDRESS notifications. It should be noted that there must be a plurality of PHYSICALADDRESS in one Row, possibly located in different page frames, and in the existing scheme (Intel), an accurate address can be obtained according to a bidirectional resolution mode; by means of inference, the next most probable valid address is found for quarantine.
For example: in row1, PHYSICALADDREES 1=0x 00012345010 is located at column10, PHYSICALADDREES2 =0x 00012346010 is located at column110, then it can be inferred that the unknown column210 is highly probable PHYSICALADDRESS 3=0x 00012347010. Due to characteristics such as memory interleaving (CPU function, which can improve memory access efficiency in parallel), the result may not be completely determined, but may satisfy most application scenarios.
Regarding the preset algorithm, a pre-trained model prediction may be used, for the memory structure, as shown in fig. 4, a lot of chips (a large granule) may be seen under a Rank, the chips may be further divided into Bank, bankGroup concepts under each Rank, and under each Rank, the number of X and Y may be understood as X Row and Y Column, and the number of rows and columns may be read and judged in the memory SPD (approximately EEPROM non-volatile storage medium). Faults caused by hardware aging may be typical, such as CEs concentrated in a Row and in a Column and in a Row, and the faults caused by hardware aging are concentrated in a Column and in a Row, but of course, the distribution is not absolute, and accidental fault types (such as faults caused by cosmic rays and not easy to reproduce) exist, and the prediction model only predicts a fault scene that the former can generate a certain amount of CEs, and does not predict sporadic occurrence. Based on the row and column characteristics, the collected information such as whether the memory is replaced or not and whether the memory generates UCE or not during after-sales maintenance is judged and marked according to the distribution condition of the typical memory row and column faults CE. The input conditions for training ultimately result in an offline model for diagnosis.
In one embodiment, as shown in fig. 5, the present disclosure also provides a memory failure prediction apparatus, applied to a computer device, where the apparatus includes: the first module is used for acquiring physical position information of a minimum storage unit Cell corresponding to the memory fault according to the system physical address corresponding to the memory fault; the second module is used for establishing fault record information carrying the system physical address and the physical position information and storing the fault record information in groups according to preset rules; a third module, configured to monitor fault record information of each packet, and if a packet whose fault value exceeds a preset threshold exists, predict a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm; and a fourth module, configured to isolate the memory space of the target isolation area according to the predicted system physical address set of the target isolation area.
In one embodiment, the obtaining, according to the system physical address corresponding to the memory failure, physical location information of the minimum storage unit Cell corresponding to the memory failure includes: and analyzing the corresponding address information NormalizedAddress, interface information NormalizedSocketid, particle information NormalizedDieId and channel information NormalizedChannelid according to the system physical address PHYSICALADDRESS corresponding to the memory fault, analyzing the corresponding library information and Row Column information according to the address information, and summarizing the information to obtain the physical position information of the minimum storage unit Cell corresponding to the memory fault.
In one embodiment, the establishing fault record information carrying system physical address and physical location information, and storing the fault record information in groups according to a preset rule, includes: and positioning to a Row-Column position of a physical position where a memory fault occurs on memory hardware according to interface information, channel information, particle information, bank information and Row-Column information included in the physical position information, establishing fault record information comprising a mapping relation between the Row-Column position and a system physical address, and storing each fault record information according to a Row position or a Column position in a grouping way.
In one embodiment, the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm, including: and if the fault record information of each group exceeds the preset threshold value, predicting all other system physical addresses mapped to the same Row position or Column position according to the system physical address mapped by the Row position or Column position according to the fault record information of each group, and taking all the predicted system physical addresses and the system physical address recorded in the fault record information of the group as a system physical address set of a target isolation area.
In one embodiment, the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm, including: setting fault values for each group according to the number of fault record information included in the group, and predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group if the group with the fault value exceeding a preset threshold exists.
In one embodiment, the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm, including: setting different fault values for the fault record information of different memory faults of different fault types and/or different accumulated fault times of the same physical position information, accumulating the fault values of each fault record information of the same packet to obtain the fault value of the packet, and if the packet with the fault value exceeding the preset threshold exists, predicting the system physical address set of the target isolation area through a preset algorithm according to the fault record information of the packet.
The device embodiments are the same as or similar to the corresponding method embodiments and are not described in detail herein.
In one embodiment, the present disclosure provides an electronic device, including a processor and a readable storage medium storing machine executable instructions capable of being executed by the processor, where the processor executes the machine executable instructions to implement the foregoing memory failure prediction method, and from a hardware level, a schematic diagram of a hardware architecture may be shown in fig. 6.
In one embodiment, the present specification provides a readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the aforementioned memory failure prediction method.
Here, a readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, or the like. For example, the readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description may take the form of a computer program product on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The above description is merely an embodiment of the present specification and is not intended to limit the present specification. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims (14)

1. A memory failure prediction method, applied to a computer device, the method comprising:
acquiring physical position information of a minimum storage unit Cell corresponding to the memory fault according to a system physical address corresponding to the memory fault;
establishing fault record information carrying system physical address and physical position information, and storing the fault record information in groups according to preset rules;
monitoring fault record information of each group, and if the group with the fault value exceeding a preset threshold exists, predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group;
and isolating the memory space of the target isolation region according to the predicted system physical address set of the target isolation region.
2. The method of claim 1, wherein the obtaining the physical location information of the minimum storage unit Cell corresponding to the memory failure according to the system physical address corresponding to the memory failure comprises:
and analyzing the corresponding address information NormalizedAddress, interface information NormalizedSocketid, particle information NormalizedDieId and channel information NormalizedChannelid according to the system physical address PHYSICALADDRESS corresponding to the memory fault, analyzing the corresponding library information and Row Column information according to the address information, and summarizing the information to obtain the physical position information of the minimum storage unit Cell corresponding to the memory fault.
3. The method of claim 2, wherein creating fault record information carrying system physical address and physical location information and storing the fault record information in groups according to a predetermined rule comprises:
And positioning to a Row-Column position of a physical position where a memory fault occurs on memory hardware according to interface information, channel information, particle information, bank information and Row-Column information included in the physical position information, establishing fault record information comprising a mapping relation between the Row-Column position and a system physical address, and storing each fault record information according to a Row position or a Column position in a grouping way.
4. A method according to claim 3, wherein monitoring the fault record information of each packet, and if there is a packet whose fault value exceeds a preset threshold, predicting the set of system physical addresses of the target isolation area according to the fault record information of the packet by a preset algorithm, includes:
And if the fault record information of each group exceeds the preset threshold value, predicting all other system physical addresses mapped to the same Row position or Column position according to the system physical address mapped by the Row position or Column position according to the fault record information of each group, and taking all the predicted system physical addresses and the system physical address recorded in the fault record information of the group as a system physical address set of a target isolation area.
5. The method according to claim 1, wherein monitoring the fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting the system physical address set of the target isolation area according to the fault record information of the packet through a preset algorithm, includes:
Setting fault values for each group according to the number of fault record information included in the group, and predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group if the group with the fault value exceeding a preset threshold exists.
6. The method according to claim 1, wherein monitoring the fault record information of each packet, if there is a packet whose fault value exceeds a preset threshold, predicting the system physical address set of the target isolation area according to the fault record information of the packet through a preset algorithm, includes:
Setting different fault values for the fault record information of different memory faults of different fault types and/or different accumulated fault times of the same physical position information, accumulating the fault values of each fault record information of the same packet to obtain the fault value of the packet, and if the packet with the fault value exceeding the preset threshold exists, predicting the system physical address set of the target isolation area through a preset algorithm according to the fault record information of the packet.
7. A memory failure prediction apparatus for use in a computer device, the apparatus comprising:
The first module is used for acquiring physical position information of a minimum storage unit Cell corresponding to the memory fault according to the system physical address corresponding to the memory fault;
the second module is used for establishing fault record information carrying the system physical address and the physical position information and storing the fault record information in groups according to preset rules;
A third module, configured to monitor fault record information of each packet, and if a packet whose fault value exceeds a preset threshold exists, predict a system physical address set of the target isolation area according to the fault record information of the packet by a preset algorithm;
And a fourth module, configured to isolate the memory space of the target isolation area according to the predicted system physical address set of the target isolation area.
8. The apparatus of claim 7, wherein the obtaining the physical location information of the minimum storage unit Cell corresponding to the memory failure according to the system physical address corresponding to the memory failure comprises:
and analyzing the corresponding address information NormalizedAddress, interface information NormalizedSocketid, particle information NormalizedDieId and channel information NormalizedChannelid according to the system physical address PHYSICALADDRESS corresponding to the memory fault, analyzing the corresponding library information and Row Column information according to the address information, and summarizing the information to obtain the physical position information of the minimum storage unit Cell corresponding to the memory fault.
9. The apparatus of claim 8, wherein creating fault record information carrying system physical address and physical location information and storing the fault record information in groups according to a predetermined rule comprises:
And positioning to a Row-Column position of a physical position where a memory fault occurs on memory hardware according to interface information, channel information, particle information, bank information and Row-Column information included in the physical position information, establishing fault record information comprising a mapping relation between the Row-Column position and a system physical address, and storing each fault record information according to a Row position or a Column position in a grouping way.
10. The apparatus of claim 9, wherein the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting the system physical address set of the target isolation area according to the fault record information of the packet through a preset algorithm, includes:
And if the fault record information of each group exceeds the preset threshold value, predicting all other system physical addresses mapped to the same Row position or Column position according to the system physical address mapped by the Row position or Column position according to the fault record information of each group, and taking all the predicted system physical addresses and the system physical address recorded in the fault record information of the group as a system physical address set of a target isolation area.
11. The apparatus of claim 7, wherein the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting the system physical address set of the target isolation area according to the fault record information of the packet through a preset algorithm, comprises:
Setting fault values for each group according to the number of fault record information included in the group, and predicting a system physical address set of a target isolation area through a preset algorithm according to the fault record information of the group if the group with the fault value exceeding a preset threshold exists.
12. The apparatus of claim 7, wherein the monitoring the fault record information of each packet, if there is a packet with a fault value exceeding a preset threshold, predicting the system physical address set of the target isolation area according to the fault record information of the packet through a preset algorithm, comprises:
Setting different fault values for the fault record information of different memory faults of different fault types and/or different accumulated fault times of the same physical position information, accumulating the fault values of each fault record information of the same packet to obtain the fault value of the packet, and if the packet with the fault value exceeding the preset threshold exists, predicting the system physical address set of the target isolation area through a preset algorithm according to the fault record information of the packet.
13. An electronic device, comprising: a processor and a readable storage medium storing machine executable instructions executable by the processor to implement the method of any one of claims 1-6.
14. A readable storage medium storing machine executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1-6.
CN202410599032.1A 2024-05-14 2024-05-14 Memory fault prediction method, device, equipment and readable storage medium Pending CN118467224A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410599032.1A CN118467224A (en) 2024-05-14 2024-05-14 Memory fault prediction method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410599032.1A CN118467224A (en) 2024-05-14 2024-05-14 Memory fault prediction method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN118467224A true CN118467224A (en) 2024-08-09

Family

ID=92160366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410599032.1A Pending CN118467224A (en) 2024-05-14 2024-05-14 Memory fault prediction method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN118467224A (en)

Similar Documents

Publication Publication Date Title
Hwang et al. Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design
Shang et al. Automated detection of performance regressions using regression models on clustered performance counters
CN109918196B (en) System resource allocation method, device, computer equipment and storage medium
CN113672415B (en) Disk fault processing method, device, equipment and storage medium
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
CN110647472A (en) Breakdown information statistical method and device, computer equipment and storage medium
CN116502166B (en) Method, device, equipment and medium for predicting faults of target equipment
CN114461436A (en) Memory fault processing method and device and computer readable storage medium
CN110737924A (en) method and equipment for data protection
CN115640174A (en) Memory fault prediction method and system, central processing unit and computing equipment
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN106844166B (en) Data processing method and device
CN115168087A (en) Method and device for determining granularity of repair resources of memory failure
CN115793990A (en) Memory health state determination method and device, electronic equipment and storage medium
US8261137B2 (en) Apparatus, a method and a program thereof
CN115705261A (en) Memory fault repairing method, CPU, OS, BIOS and server
CN117472623A (en) Method, device, equipment and storage medium for processing memory fault
CN118467224A (en) Memory fault prediction method, device, equipment and readable storage medium
CN116501705A (en) RAS-based memory information collecting and analyzing method, system, equipment and medium
JP6993472B2 (en) Methods, devices, electronic devices, and computer storage media for detecting deep learning chips
CN115686909A (en) Memory fault prediction method and device, storage medium and electronic device
CN115080331A (en) Fault processing method and computing device
CN110858167B (en) Memory fault isolation method, device and equipment
CN111581044A (en) Cluster optimization method, device, server and medium
CN112346932B (en) Method and device for positioning hidden bad disk, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination