CN117971547A - Memory fault prediction method, device, equipment, storage medium and program product - Google Patents

Memory fault prediction method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN117971547A
CN117971547A CN202410372877.7A CN202410372877A CN117971547A CN 117971547 A CN117971547 A CN 117971547A CN 202410372877 A CN202410372877 A CN 202410372877A CN 117971547 A CN117971547 A CN 117971547A
Authority
CN
China
Prior art keywords
storage unit
determining
memory
target
coordinate point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410372877.7A
Other languages
Chinese (zh)
Inventor
孔涛
李锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410372877.7A priority Critical patent/CN117971547A/en
Publication of CN117971547A publication Critical patent/CN117971547A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of computers and discloses a memory fault prediction method, a device, equipment, a storage medium and a program product. The method comprises the following steps: determining a target storage unit to be predicted in a plurality of storage units included in the memory to be predicted according to the reference prediction parameter and first correctable error data of the memory to be predicted; dividing the target storage unit into at least one storage unit set; determining storage unit distribution characteristics of each storage unit set based on the reference prediction parameters; and performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result. The invention realizes effective prediction of the memory faults and reduces the risks of data loss, service interruption and the like caused by the memory faults.

Description

Memory fault prediction method, device, equipment, storage medium and program product
Technical Field
The invention relates to the technical field of computers, in particular to a memory fault prediction method, a device, equipment, a storage medium and a program product.
Background
With the advent of the big data age, the memory needs more and more data to be stored, and the running frequency of the memory is also faster and faster. However, the faster the operating frequency, the more susceptible the signal to interference and error, resulting in memory errors. Memory errors include correctable errors (Correctable Error, CE) and uncorrectable errors (Uncorrectable Error, UCE). That is, the memory has a certain Error checking and Correcting capability, abbreviated as ECC (Error CHECKING AND Correcting). Each time the memory generates an ECC, the CE is reported, and if the error generated by the memory exceeds the error correction capability of the memory, the CE is converted into UCE. When the UCE occurs in the memory, the equipment where the memory is located is down or restarted, so that the problems of service data loss, service interruption and the like are caused.
Disclosure of Invention
In view of the above, the present invention provides a memory prediction method, apparatus, device, storage medium and program product, so as to solve the problems of data loss, service interruption, etc. caused by memory errors.
In a first aspect, the present invention provides a memory failure prediction method, where the method includes:
determining a target storage unit to be predicted in a plurality of storage units included in the memory to be predicted according to the reference prediction parameter and first correctable error data of the memory to be predicted; the reference prediction parameters represent the prediction range of the storage unit;
Dividing the target storage unit into at least one storage unit set;
determining storage unit distribution characteristics of each storage unit set based on the reference prediction parameters;
And performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result.
In the memory failure prediction method provided by the embodiment, a target memory unit to be predicted in a plurality of memory units included in a memory to be detected is determined according to a reference prediction parameter and first correctable error data of the memory to be detected; dividing a target storage unit into at least one storage unit set, and determining storage unit distribution characteristics of each storage unit set based on reference prediction parameters; and performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result. Since the distribution of memory cells that produce correctable errors is more concentrated, the risk of failure of ECC correction is often increased. Therefore, by determining the storage unit distribution characteristics of each storage unit set, accurate prediction of the memory faults can be realized based on the storage unit distribution characteristics, so that risks of data loss, service interruption and the like caused by the memory faults are reduced.
In an alternative embodiment, determining a target storage unit to be predicted in a plurality of storage units included in the memory to be measured according to the reference prediction parameter and the first correctable error data of the memory to be measured includes:
mapping address information of each storage unit included in the memory to be tested from a three-dimensional space to a two-dimensional space;
marking at least one first storage unit corresponding to the first correctable error data in a two-dimensional space;
and determining the target storage unit to be predicted according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space.
In the memory failure prediction method provided by the embodiment, by mapping the address information of each storage unit from the three-dimensional space to the two-dimensional space, the distribution situation of the first storage unit capable of correcting errors can be generated on the basis of subsequent visual display of the two-dimensional space, and data required by failure prediction can be conveniently determined.
In an alternative embodiment, marking at least one first memory location corresponding to first correctable error data in the two-dimensional space includes:
Acquiring target address information of at least one first storage unit generating a correctable error in a stereoscopic space from first correctable error data;
Determining target two-dimensional coordinates of each first storage unit according to the target address information;
and marking coordinate points corresponding to the two-dimensional coordinates of the target in the two-dimensional space.
In the memory failure prediction method provided in this embodiment, since the distribution situation of the first storage units is closely related to the occurrence of the memory failure, by marking each first storage unit in the two-dimensional space, the target storage unit can be accurately determined based on the distribution situation of the first storage units later.
In an alternative embodiment, marking at least one first memory location corresponding to the first correctable error data in the two-dimensional space includes:
Acquiring first correctable error data of the memory to be tested at intervals of a first preset time interval;
Determining whether an unlabeled candidate first storage unit exists in at least one first storage unit corresponding to the first correctable error data;
if yes, the candidate first storage unit is marked in the two-dimensional space.
In the memory failure prediction method provided by the embodiment, the candidate first storage units are determined and marked in the two-dimensional space, so that repeated marking of the first storage units is avoided, and marking efficiency is improved.
In an alternative embodiment, determining the target storage unit to be predicted according to the reference prediction parameter and the two-dimensional coordinates of each first storage unit in the two-dimensional space includes:
Determining discrete information of each first storage unit according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space;
And determining a target storage unit to be predicted according to the discrete information.
In the memory failure prediction method provided by the embodiment, since the discrete information can represent whether the coordinate point corresponding to the corresponding first storage unit is a discrete point or not, and the influence of the discrete point on the health state of the memory is small, the target storage unit is determined according to the discrete information, the accuracy of the target storage unit is ensured, and the accuracy of the prediction result is further ensured.
In an alternative embodiment, determining the discrete information of each first storage unit according to the reference prediction parameter and the two-dimensional coordinates of each first storage unit in the two-dimensional space includes:
According to the two-dimensional coordinates of each first storage unit in the two-dimensional space, determining first local distribution information of each first storage unit in a neighborhood corresponding to the reference prediction parameter;
and determining the discrete information of each first storage unit according to the first local distribution information.
According to the two-dimensional coordinates of each first storage unit in the two-dimensional space, determining first local distribution information of each first storage unit in a neighborhood corresponding to the reference prediction parameter comprises the following steps:
determining a current first coordinate point and a current second coordinate point; the second coordinate point is a coordinate point except the first coordinate point in the coordinate points corresponding to the two-dimensional coordinates;
determining a first distance between the first coordinate point and each second coordinate point;
determining a neighborhood of the first coordinate point according to the first distance and the reference prediction parameter;
determining a first reachable distance from the first coordinate point to each second coordinate point in the neighborhood;
Determining a first local reachable density of the first coordinate point according to the first reachable distance;
the first local distribution information of the first storage unit corresponding to the first coordinate point includes a first local reachable density.
In the memory failure prediction method provided by the embodiment, each first reachable distance of the first coordinate point in the neighborhood corresponding to the reference prediction parameter is determined, and the first local reachable density of the first coordinate point is determined according to each first reachable distance. The health state of the memory can be accurately predicted from the perspective of the local area based on the first local reachable density.
In an alternative embodiment, determining the first discrete information of each first storage unit according to the local distribution information includes:
Determining a first average local reachable density of the first local reachable density of each coordinate point in the neighborhood of the first coordinate point;
determining a first local relative density of the first coordinate point according to the first average local reachable density and the first local reachable density of the first coordinate point;
the discrete information of the first coordinate point corresponding to the first memory location includes a first local relative density.
In the memory failure prediction method provided in this embodiment, since the distribution of the generated first memory cells capable of correcting errors in the two-dimensional space is not uniform, even if there are some first memory cells with discrete distribution in a local range, the first memory cells with discrete distribution often do not affect the health condition of the memory or have little effect. Therefore, by determining the first local relative density of the coordinate points corresponding to each first storage unit, whether each coordinate point is a discrete point or not can be measured more accurately, and therefore the target storage unit can be determined more accurately.
In an alternative embodiment, determining storage unit distribution characteristics for each storage unit set based on the reference prediction parameters includes:
Determining second local distribution information of each target storage unit in the storage unit set according to the reference prediction parameters and the two-dimensional coordinates of each target storage unit in the storage unit set in the two-dimensional space;
and determining the storage unit distribution characteristics of each storage unit set according to the second local distribution information.
In the memory failure prediction method provided by the embodiment, the prediction of the local storage unit distribution is realized by determining the second local distribution information of each storage unit in each storage unit set and determining the storage unit distribution characteristics of each storage unit set according to the second local distribution information.
In an alternative embodiment, the second local distribution information includes a second local reachable density, and the storage unit distribution feature includes a second average local reachable density of the second local reachable density; performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result, wherein the first fault prediction result comprises:
determining a first target average local reachable density which is the largest in the second average local reachable densities;
determining an average local reachable density of a first target, wherein the target density ranges belong to a plurality of preset density ranges;
And determining a first fault prediction result according to the target density range.
In the memory failure prediction method provided by the embodiment, the distribution of each target storage unit in the corresponding storage unit set is represented to be most concentrated due to the maximum first target average local reachable density in the storage unit distribution characteristics, and the risk of error correction failure of the ECC is maximum, so that the first failure prediction result is determined based on the first target average local reachable density, the accuracy of the first failure prediction result is ensured, and the accurate prediction of the memory failure to be detected is realized.
In an alternative embodiment, before determining a target storage unit to be predicted in a plurality of storage units included in the memory to be tested according to the reference prediction parameter and the first correctable error data of the memory to be tested, the method further includes:
acquiring initial prediction parameters to be optimized;
and carrying out optimization processing on the initial prediction parameters according to a preset optimization rule to obtain reference prediction parameters.
The method comprises the steps of optimizing initial prediction parameters according to a preset optimization rule to obtain reference prediction parameters, wherein the method comprises the following steps:
Acquiring second correctable error data of the target memory and real fault data corresponding to the second correctable error data;
Performing second fault prediction processing by using the initial prediction parameters and the second correctable error data to obtain a second fault prediction result;
determining whether the initial prediction parameters meet the optimization conditions according to the second fault prediction result and the real fault data;
If the optimization condition is met, optimizing the initial prediction parameters, and performing second fault prediction processing according to the optimized initial prediction parameters and second correctable error data until the initial prediction parameters are determined to not meet the optimization condition;
If the optimization condition is not satisfied, determining the initial prediction parameter when the optimization condition is not satisfied as a reference prediction parameter.
In the memory failure prediction method provided by the embodiment, the initial prediction parameters are optimized to obtain the reference prediction parameters, and the failure prediction processing is performed on the memory to be detected based on the reference prediction parameters, so that the accuracy of the first failure prediction result is ensured.
In a second aspect, the present invention provides a memory failure prediction apparatus, including:
The first determining module is used for determining a target storage unit to be predicted in a plurality of storage units included in the memory to be detected according to the reference prediction parameter and first correctable error data of the memory to be detected; the reference prediction parameters characterize the prediction range of the storage unit;
the dividing module is used for dividing the target storage unit into at least one storage unit set;
The second determining module is used for determining storage unit distribution characteristics of each storage unit set based on the reference prediction parameters;
And the prediction module is used for performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result.
In a third aspect, the present invention provides a computer device comprising: the memory and the processor are in communication connection, computer instructions are stored in the memory, and the processor executes the computer instructions, so that the memory failure prediction method of the first aspect or any implementation manner corresponding to the first aspect is executed.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the memory failure prediction method of the first aspect or any one of the embodiments corresponding thereto.
In a fifth aspect, the present invention provides a computer program product comprising computer instructions for causing a computer to perform the memory failure prediction method of the first aspect or any of its corresponding embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a memory failure prediction method according to an embodiment of the invention;
FIG. 2 is a schematic diagram illustrating a connection between a memory and a CPU according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an internal structure of a memory according to an embodiment of the invention;
FIG. 4 is a flowchart of another memory failure prediction method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a first memory cell of a tag according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a p-th neighborhood of point m according to an embodiment of the present invention;
FIG. 7 is a flowchart of a method for predicting a memory failure according to an embodiment of the present invention;
FIG. 8 is a flowchart of a method for predicting a memory failure according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of the variation of initial prediction parameters during an optimization process according to an embodiment of the present invention;
FIG. 10 is a block diagram illustrating a memory failure prediction apparatus according to an embodiment of the present invention;
fig. 11 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to predict the memory failure, in the related art, at least one of correctable error CE event information of the memory, dynamic characteristics of the server, and static characteristics of the server is collected to perform prediction processing. The dynamic characteristics of the server comprise at least one of dynamic information of a server memory and dynamic information of a server processor CPU; the static features of the server include at least one of vendor of memory, memory factory time, memory capacity, memory rate, CPU type, and server motherboard type. However, in this prediction mode, the collected features are features related to the service business, and cannot truly reflect the real health condition of the memory. Therefore, the accuracy of memory failure prediction is low. Based on the above, the invention provides a memory fault prediction method, which determines a target memory unit to be predicted in a plurality of memory units included in a memory to be detected according to a reference prediction parameter and first correctable error data of the memory to be detected; dividing a target storage unit into at least one storage unit set, and determining storage unit distribution characteristics of each storage unit set based on reference prediction parameters; and performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result. Since the distribution of memory cells that produce correctable errors is more concentrated, the risk of failure of ECC correction is often increased. Therefore, by determining the storage unit distribution characteristics of each storage unit set, accurate prediction of the memory faults can be realized based on the storage unit distribution characteristics, so that risks of data loss, service interruption and the like caused by the memory faults are reduced.
In accordance with an embodiment of the present invention, there is provided an embodiment of a memory failure prediction method, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
In this embodiment, a memory failure prediction method is provided, which may be used in the above-mentioned computer devices, such as a mobile phone, a tablet computer, a desktop computer, a server, etc., fig. 1 is a flowchart of a memory failure prediction method according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:
Step S101, determining a target storage unit to be predicted in a plurality of storage units included in a memory to be detected according to a reference prediction parameter and first correctable error data of the memory to be detected; the reference prediction parameters characterize a prediction horizon for the memory cells.
In one device, as shown in fig. 2, at least one memory (memory) is connected to the central processing unit through a channel (channel), and 4 memories are shown in fig. 2 as an example. The internal hierarchy of the memory may be divided into: rank) > group (chip) > region (bank) > row (row)/column (column) > memory cell (cell) > data unit (bit). As shown in fig. 3, a memory may include a plurality of ranks, each rank including a plurality of groups, each group including a plurality of regions, each region including a plurality of rows and a plurality of columns, from which a memory cell may be uniquely determined, each memory cell including a plurality of data units. Error Checking and Correction (ECC) of the memory is implemented by dividing each memory cell (cell) into data bits and check bits. When error data occurs in the data bit of the memory, the error data in the data bit can be corrected through the check code stored in the check bit to ensure the accuracy of the data, and then the memory can report CE errors to the CPU through the corresponding pipeline at the moment and generate log data.
The memory to be tested in the invention can be the memory of the target equipment, and the target equipment can be the computer equipment or other equipment. Accordingly, step S101 may include: the method comprises the steps of obtaining log data of target equipment, obtaining first correctable error data of a memory to be tested from the log data, and determining target memory units to be tested in a plurality of memory units included in the memory to be tested according to reference prediction parameters and the first correctable error data. The first correctable error data may include destination address information of a first memory cell generating the CE, where the destination address information may include information of a rank, a group, an area, a row, a column, and the like where the first memory cell is located.
The reference prediction parameter may be an integer greater than 1, and is used for representing that the relevant characteristics of the storage unit are predicted in a neighborhood corresponding to the storage unit based on the prediction parameter. For example, if the reference prediction parameter is 3, the relevant feature of the storage unit is predicted by characterizing the 3 rd adjacent area of the storage unit.
Step S102, dividing the target storage unit into at least one storage unit set.
In order to accurately predict the memory failure, the target memory unit is divided into at least one memory unit set, and the subsequent processing is performed based on the memory unit set.
Step S103, based on the reference prediction parameters, determining storage unit distribution characteristics of each storage unit set.
Considering the region where the memory cells that generate CEs are more concentrated, the risk of ECC error correction failure tends to increase. Based on this, in the present invention, the memory cell distribution characteristics of each memory cell set are determined based on the reference prediction parameters.
And step S104, performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result.
The first prediction result represents a current state of the memory to be tested, such as a health condition, a sub-health condition, an impending failure or a failure.
According to the memory fault prediction method provided by the embodiment, a target storage unit to be predicted in a plurality of storage units included in a memory to be detected is determined according to a reference prediction parameter and first correctable error data of the memory to be detected; dividing a target storage unit into at least one storage unit set, and determining storage unit distribution characteristics of each storage unit set based on reference prediction parameters; and performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result. Since the distribution of memory cells that produce correctable errors is more concentrated, the risk of failure of ECC correction is often increased. Therefore, by determining the storage unit distribution characteristics of each storage unit set, accurate prediction of the memory faults can be realized based on the storage unit distribution characteristics, so that risks of data loss, service interruption and the like caused by the memory faults are reduced.
In this embodiment, a memory failure prediction method is provided, which may be used in the above-mentioned computer devices, such as a mobile phone, a tablet computer, a desktop computer, a server, etc., and fig. 4 is a flowchart of a memory failure prediction method according to an embodiment of the present invention, as shown in fig. 4, where the flowchart includes the following steps:
Step S401, determining a target storage unit to be predicted in a plurality of storage units included in the memory to be detected according to the reference prediction parameter and the first correctable error data of the memory to be detected; the reference prediction parameters characterize a prediction horizon for the memory cells.
As shown in the memory structure of fig. 2, the storage space of the memory is a three-dimensional space, and the smallest unit of the memory for storing data is a memory cell (cell). To facilitate accurate determination of target storage units, in some embodiments, each storage unit of the memory to be tested may be considered as a point in two-dimensional space and mapped from the corresponding stereoscopic space of the memory to the two-dimensional space, thereby determining target storage units based on the two-dimensional space. That is, step S401 may include the following steps S4011 to S4013:
In step S4011, address information of each storage unit included in the memory to be tested is mapped from the three-dimensional space to the two-dimensional space.
Specifically, determining address information of each storage unit in a three-dimensional space corresponding to the memory to be tested; determining the two-dimensional coordinates of each storage unit according to the address information; address information is mapped from the stereoscopic space to the two-dimensional space according to each two-dimensional coordinate. The two-dimensional coordinates of each storage unit are determined according to the address information, and the address information of each storage unit may be converted by using the following formula one and the following formula two to obtain the two-dimensional coordinates of each storage unit. The address information comprises information of a sequence level, a group, an area, a row, a column and the like where the storage unit is located.
Equation one:
Formula II:
Wherein, Representing the abscissa; /(I)Representing the ordinate; /(I)Representing a rank; /(I)A representation group; /(I)A representation area; representing a row; /(I) A representation column; /(I)、/>And/>Is constant, and the specific value can be set according to the needs in practical application.
By mapping the address information of each storage unit from the three-dimensional space to the two-dimensional space, the distribution situation of the first storage unit which can correct errors can be generated on the basis of subsequent visual display of the two-dimensional space, and the data required by fault prediction can be conveniently determined.
In step S4012, at least one first storage unit corresponding to the first correctable error data is marked in the two-dimensional space.
Specifically, target address information in a three-dimensional space corresponding to a memory of at least one first storage unit generating a correctable error is obtained from first correctable error data; determining target two-dimensional coordinates of each first storage unit according to the target address information; and marking coordinate points corresponding to the two-dimensional coordinates of the target in the two-dimensional space. The process of determining the target two-dimensional coordinates of each first storage unit according to the target address information is the same as the process of determining the two-dimensional coordinates of each storage unit according to the address information, and reference may be made to the related description above, and the repetition is not repeated here.
To accurately determine the memory cell distribution characteristics, in some embodiments, the first correctable error data may be acquired multiple times and marked only once in two-dimensional space for each first memory cell. That is, step S4012 may include: acquiring first correctable error data of the memory to be tested at intervals of a first preset time interval; determining whether an unlabeled candidate first storage unit exists in at least one first storage unit corresponding to the first correctable error data; if yes, the candidate first storage unit is marked in the two-dimensional space.
In some embodiments, the target storage unit may be determined once every second preset time interval, that is, step S4013 is performed. The second preset time interval is greater than the first preset time interval, for example, the second preset time interval is 30 minutes, and the first preset time interval is 5 minutes.
In other embodiments, step S4013 may be performed every time the first correctable error of the preset number of times is acquired. For example, the first preset time interval is 3 minutes and the preset number of times is 5, i.e. the target storage unit is determined every 15 minutes.
Since the distribution of the first storage units is closely related to the occurrence of the memory failure, by marking each first storage unit in the two-dimensional space, the target storage unit can be accurately determined based on the distribution of the first storage units later.
In step S4013, a target storage unit to be predicted is determined according to the reference prediction parameter and the two-dimensional coordinates of the first storage unit in the two-dimensional space.
As an example, a schematic diagram of each first storage unit marked in step S4013 is shown in fig. 5. As can be seen from fig. 5, some of the first storage units may be distributed in a relatively concentrated manner and clustered into one or more clusters, such as cluster b1 and cluster b2 in fig. 5. Still other first memory locations may be distributed more closely spaced from each cluster. Considering that in practical applications, the discrete points are distributed, which often has little effect on the health status of the memory, the target storage unit may be determined according to the discrete information of the first storage unit in some embodiments. That is, step S4013 may include the following steps a1 and a2:
and a1, determining discrete information of each first storage unit according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space.
In some embodiments, step a1 may include the following steps a11 and a12:
step a11, determining first local distribution information of each first storage unit in a neighborhood corresponding to the reference prediction parameter according to the two-dimensional coordinates of each first storage unit in the two-dimensional space.
Specifically, determining a current first coordinate point and a second coordinate point, wherein the second coordinate point is a coordinate point except the first coordinate point in all coordinate points corresponding to the marked two-dimensional coordinates; determining a first distance between the first coordinate point and each second coordinate point; determining a neighborhood of the first coordinate point according to the first distance and the reference prediction parameter; determining a first reachable distance from the first coordinate point to each second coordinate point in the neighborhood; determining a first local reachable density of the first coordinate point according to the first reachable distance; the first local distribution information of the first storage unit corresponding to the first coordinate point includes a first local reachable density.
In some embodiments, each coordinate point marked in the two-dimensional space may be sequentially determined as a current first coordinate point, and each coordinate point other than the first coordinate point among the marked coordinate points may be determined as a second coordinate point.
In some embodiments, determining the neighborhood of the first coordinate point according to the first distance and the reference prediction parameter may include: and calculating a first distance between the first coordinate point and each second coordinate point according to the two-dimensional coordinates of the first coordinate point and each second coordinate point. And sequencing the first distance to obtain a sequencing result. And acquiring a target first distance corresponding to the reference prediction parameter from the sequencing result, and determining the target distance as a proximity distance corresponding to the reference prediction parameter of the first coordinate point. And drawing a circle by taking the first coordinate point as a circle and the adjacent distance as a radius, and determining the area within the circle as a neighborhood corresponding to the reference prediction parameter of the first coordinate point.
The first distance may be any one of a euclidean distance, a manhattan distance, a chebyshev distance, a normalized euclidean distance, a hamming distance, a mahalanobis distance, or the like. Reference prediction parameters are recorded asThe first coordinate point is denoted as m, the point corresponding to the first distance of the target is denoted as n, and the first distance between the point m and the point n can be denoted as/>. The proximity distance corresponding to the reference prediction parameter may be referred to as/>The proximity distance may also be referred to as/>Proximity distance. m/>The proximity distance can be expressed asI.e./>. The neighborhood corresponding to the reference prediction parameter may be referred to as a p-distance neighborhood, and may also be referred to as a p-th neighborhood. The p-th neighborhood of m is denoted/>It refers to all points within the p-th distance of m, and also includes the point n corresponding to the p-th distance. As shown in FIG. 6, the number of points in the m's p-th neighborhood is expressed as/>Then there isAnd there are at least p points q such that/>At most p-1 points q are present such that
It can be understood that the first distances are ordered differently, and the first distances of the targets are acquired differently. As an example, the reference prediction parameter is 10, and when the sorting process is the ascending process, the 10 th first distance in the sorting result is determined as the proximity distance corresponding to the reference prediction parameter of the first coordinate point in the order from small to large. When the sorting process is a descending process, the first distance from the 10 th reciprocal in the sorting result is determined as the adjacent distance corresponding to the reference prediction parameter of the first coordinate point in the order from the large to the small. This approach distance may also be referred to as the 10 th approach distance.
Further, the determining the first reachable distance from the first coordinate point to each second coordinate point in the neighborhood may include: acquiring a target first distance from a first coordinate point to each second coordinate point in the neighborhood from the first distance; determining the adjacent distance corresponding to the reference prediction parameters of each second coordinate point in the adjacent region; comparing the first distance and the adjacent distance of the targets of the second coordinate points for each second coordinate point in the neighborhood of the first coordinate point to obtain the maximum distance; the maximum distance is determined as the first reachable distance of the first coordinate point to the corresponding second coordinate point. The process of determining the proximity distance corresponding to the reference prediction parameter of each second coordinate point in the neighborhood is the same as the process of determining the proximity distance corresponding to the reference prediction parameter of the first coordinate point, and reference to the related description is omitted here for brevity. That is, the first reachable distance from point m to point n can be expressed as: . Wherein/> Representing the first reachable distance from point m to point n,/>Representing the p-th neighborhood distance of point n,/>A first distance between point m and point n is indicated.
Further, in some embodiments, determining the first local reachable density of the first coordinate point according to the first reachable distance may include: an average value of each first reachable distance of the first coordinate point is determined, and the reciprocal of the average value is determined as the first local reachable density of the first coordinate point. The first coordinate point is noted as m, and the first local reachable density of the first coordinate point can be expressed as the following formula three.
And (3) a formula III:
Wherein, A first local reachable density representing m; /(I)Is the p-th neighborhood/>Any second coordinate point within; /(I)Representing the p-th neighborhood/>, of the pair mA sum of the first reachable distances of the second coordinate points; /(I)Represents the p-th neighborhood/>Total number of inner second coordinate points; i.e.An average value of the first reachable distances of the first coordinate points is represented.
In consideration of that the distribution of the first storage units generating the correctable errors in the two-dimensional space is not uniform, the first local reachable densities of the first coordinate points are determined by determining the respective first reachable distances of the first coordinate points in the neighborhood corresponding to the reference prediction parameters thereof, and determining the first local reachable densities of the first coordinate points according to the respective first reachable distances. The health state of the memory can be accurately predicted from the perspective of the local area based on the first local reachable density.
Step a12, determining the discrete information of each first storage unit according to the first local distribution information.
Specifically, a first average local reachable density of a first local reachable density of each coordinate point in a neighborhood of the first coordinate point is determined; determining a first local relative density of the first coordinate point according to the first average local reachable density and the first local reachable density of the first coordinate point; the discrete information of the first coordinate point corresponding to the first memory location includes a first local relative density.
It will be appreciated that, before step a12, since each coordinate point of the mark in the two-dimensional space is determined as the first coordinate point in turn, the first local reachable density of each coordinate point of the mark in the two-dimensional space can be obtained. Therefore, in step a12, for each first coordinate point, a first average local reachable density of the first local reachable densities of each coordinate point in the neighborhood corresponding to the reference prediction parameter of the first coordinate point may be determined. In some embodiments, the first local relative density of the first coordinate point may be determined according to the following equation four.
Equation four:
Wherein, Representing a first local relative density of a first coordinate point m,/>Is the p-th neighborhood/>Any coordinate point in the range, namely f can be a first coordinate point m, and can also be the p-th neighborhood/>, of mSecond coordinate points inRepresents the p-th neighborhood/>A first locally reachable density of any coordinate point within; Represents the p-th neighborhood/> A first average local reachable density of the first local reachable density of each coordinate point; /(I)A first locally attainable density representing a first coordinate point m.
Since the distribution of the generated first memory cells capable of correcting errors in the two-dimensional space is not uniform, even in a local range, some first memory cells with discrete distribution exist, the first memory cells are far away from other first memory cells, and finally the value of the first local reachable density of the first memory cells is reduced. However, the first memory cells with discrete distributions often do not affect the health of the memory or have little effect. Therefore, in order to sufficiently adapt to the situation that the first storage units are unevenly distributed and have different densities, in some embodiments, the first local relative density of the coordinate points corresponding to each first storage unit may be determined, so as to more accurately measure whether each coordinate point is a discrete point, thereby more accurately determining the target storage unit.
And a step a2, determining a target storage unit to be predicted according to the discrete information.
Specifically, determining whether the first local relative density is greater than a density threshold; if not, determining the first storage unit corresponding to the first local relative density as a target storage unit to be predicted. When the first local relative density is greater than the density threshold, the local relative density of the corresponding coordinate point is characterized as being smaller than the adjacent point, namely the coordinate point belongs to a discrete point, and the influence on the health condition of the memory is very little, so that the coordinate point is a point which can be abandoned.
In step S402, the target storage unit is divided into at least one storage unit set.
In some embodiments, a first distance less than the preset distance may be determined, and each target storage unit corresponding to the first distance less than the preset distance may be divided into one storage unit set.
Step S403, determining a storage unit distribution characteristic of each storage unit set based on the reference prediction parameter.
And step S404, performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result.
The specific implementation manner of step S402 to step S404 may be referred to the related description above, and the repetition is not repeated here.
In the memory fault prediction method provided by the embodiment, a first local reachable density is determined by determining a first reachable distance of each coordinate point marked in a two-dimensional space and based on the first reachable distance, so as to obtain first local distribution information of a corresponding first storage unit; and then determining the first local relative density according to the first local distribution information to obtain first discrete information of the first storage unit. Whether the coordinate point corresponding to the corresponding first storage unit is a discrete point can be accurately determined based on the first discrete information, so that the accuracy of the determined target storage unit is improved.
In this embodiment, a memory failure prediction method is provided, which may be used in the above-mentioned computer devices, such as a mobile phone, a tablet computer, a desktop computer, a server, etc., fig. 7 is a flowchart of a memory failure prediction method according to an embodiment of the present invention, and as shown in fig. 7, the flowchart includes the following steps:
Step S701, determining a target storage unit to be predicted in a plurality of storage units included in the memory to be detected according to the reference prediction parameter and first correctable error data of the memory to be detected; the reference prediction parameters characterize a prediction horizon for the memory cells.
In step S702, the target storage unit is divided into at least one storage unit set.
In step S703, storage unit distribution characteristics of each storage unit set are determined based on the reference prediction parameters.
In some embodiments, step S703 may include the following steps S7031 and S7032:
Step S7031, determining second local distribution information of each storage unit in the storage unit set according to the reference prediction parameter and the two-dimensional coordinates of each storage unit in the storage unit set in the two-dimensional space.
Wherein the second local distribution information comprises a second local reachable density. The determining manner of the second local distribution information is the same as that of the first local distribution information, and reference may be made to the foregoing related description, and the details are not repeated here.
Step S7032, determining storage unit distribution characteristics of each storage unit set according to the second local distribution information.
Specifically, for each storage unit set, determining a second average local reachable density of the second local reachable density in each storage unit in the storage unit set according to the second local reachable density of each storage unit in the storage unit set and the first number of the storage units; and determining the second average local reachable density as a memory cell distribution characteristic of the set of memory cells.
Step S704, performing first fault prediction processing according to the storage unit distribution characteristics to obtain a first fault prediction result.
In some embodiments, step S704 may include the following steps S7041 to S7043:
In step S7041, the maximum first target average local reachable density among the second average local reachable densities included in the distribution characteristics of each storage unit is determined.
Specifically, the second average local reachable densities of the storage unit sets are compared to obtain the maximum first target average local reachable density.
In step S7042, an average local reachable density of the first target is determined, and the target density range is included in the plurality of preset density ranges.
In order to accurately measure the health status of the memory to be measured, in some embodiments, a plurality of fault detection results of different types, including health, sub-health, faults, etc. are preset. Each fault detection result corresponds to a density range. After the first target average local reachable density is determined, the first target average local reachable density is matched with each preset density range, and the preset density range successfully matched is determined as the target density range to which the first target average local reachable density belongs.
Step S7043, determining a first failure prediction result according to the target density range.
Specifically, a fault detection result corresponding to the target density range is determined as a first fault prediction result. In some embodiments, the first target average local reachable density is recorded asIf the first target average local reachable density/>The target density range is: /(I)Obtaining a first fault prediction result representing that the memory to be tested is in a healthy state; if the first target average local reachable density/>The target density range is: /(I)Obtaining a first fault prediction result representing that the memory to be tested is in a sub-health state; if the first target average local reachable density/>The target density range is: /(I)And obtaining a first fault prediction result representing that the memory to be tested is about to fail or has failed. Wherein/>Is a reference locally reachable density determined based on the reference prediction parameters.
In practical application, a plurality of target devices may be considered, so that an administrator can intuitively know the health states of the memories to be tested of the target devices, and corresponding processing measures are adopted. In some embodiments, different colors may also be displayed for each first failure prediction result according to the type of the first failure prediction result. As an example, when the first failure prediction result is indicative that the memory to be tested is in a healthy state, green may be displayed, and the administrator does not need to pay attention to the first failure prediction result; when the first failure prediction result is used for representing that the memory to be tested is in a sub-health state, the memory to be tested can be displayed in orange, and an administrator needs to track and observe the first failure prediction result; when the first failure prediction result is used for representing that the memory to be tested is about to fail or has failed, red color can be displayed, and an administrator performs processing such as replacement and the like on the memory to be tested corresponding to the first failure prediction result, so that the corresponding target equipment is prevented from being down or restarted due to the failure of the memory, and further data loss, service interruption and the like are caused.
In the memory failure prediction method provided by the embodiment, the prediction of the local storage unit distribution is realized by determining the second local distribution information of each storage unit in each storage unit set and determining the storage unit distribution characteristic of each storage unit set according to the second local reachable density included in the second local distribution information. Furthermore, as the maximum first target average local reachable density in the distribution characteristics of the storage units represents the most concentrated distribution of each target storage unit in the corresponding storage unit set, and the error correction failure risk of the ECC is maximum, the first target average local reachable density is matched with a plurality of preset density ranges, the accuracy of a first failure prediction result is ensured, and the accurate prediction of the memory failure to be detected is realized.
In this embodiment, a memory failure prediction method is provided, which may be used in the above-mentioned computer devices, such as a mobile phone, a tablet computer, a desktop computer, a server, etc., and fig. 8 is a flowchart of a memory failure prediction method according to an embodiment of the present invention, as shown in fig. 8, where the flowchart includes the following steps:
step S801, obtain the initial prediction parameters to be optimized.
In some embodiments, the initial prediction parameters may be initialized at any time.
Step S802, optimizing the initial prediction parameters according to a preset optimization rule to obtain reference prediction parameters.
In some embodiments, step S802 may include the following steps S8021 to S8025:
in step S8021, second correctable error data of the target memory and real fault data corresponding to the second correctable error data are obtained.
Specifically, log data of the device where the target memory exists is obtained, and second correctable error data is obtained from the log data. And acquiring real fault data corresponding to the second correctable error data from the work order system. The real fault data in the work order system are fault data recorded manually according to the actually-occurring memory faults.
Step S8022, performing a second fault prediction process by using the initial prediction parameters and the second correctable error data to obtain a second fault prediction result.
Specifically, input data is determined according to the initial prediction parameters and the second correctable error data; and inputting the input data into a fault prediction model to be trained for prediction processing, and obtaining a second fault prediction result.
Wherein determining the input data based on the initial prediction parameters and the second correctable error data may include: and determining a second target average local reachable density according to the initial prediction parameters and the second correctable error data, and determining the second target average local reachable density as input data. The determination method of the average local reachable density of the second target is the same as that of the first target, and the above description is referred to, and the repetition is omitted here.
The fault prediction model can be any one of prediction algorithms such as logistic regression, decision trees, support vector machines, random forests and the like to perform fault prediction processing. For specific implementation processes of prediction algorithms such as logistic regression, decision trees, support vector machines, random forests, and the like, reference may be made to related technologies, which are not described in detail in the present disclosure. The second prediction result represents whether the memory fault occurs in the target memory within a preset time period.
Step S8023, determining whether the initial prediction parameters meet the optimization conditions according to the second fault prediction result and the real fault data.
Specifically, the second fault prediction result and the real fault data are compared. And if the second fault prediction result indicates that the target has a fault in a future preset period, and the real fault data indicate that the target has a memory fault in the future preset period, marking the second fault prediction result as positive correlation. And if the second fault prediction result indicates that the target has a fault in a future preset period and the real fault data indicate that the target has no memory fault in the future preset period, marking the second fault prediction result as negative correlation. And the second fault prediction result represents that the memory in the target is in a healthy state within a future preset period, and the real fault data indicate that the memory fault occurs within the future preset period, and the training result is marked as negative correlation. When the second failure prediction result is marked as a negative correlation, it is determined that the initial prediction parameter satisfies the optimization condition.
Step S8024, if the optimization condition is met, optimizing the initial prediction parameters, and performing a second fault prediction process according to the optimized initial prediction parameters and the second correctable error data until it is determined that the initial prediction parameters do not meet the optimization condition.
Specifically, when the optimization condition is satisfied, the value of the initial prediction parameter is adjusted, and according to the adjusted initial prediction parameter and the second correctable error data, the second fault prediction processing is performed according to the foregoing manner until it is determined that the initial prediction parameter does not satisfy the optimization condition.
In step S8025, if the optimization condition is not satisfied, the initial prediction parameter when the optimization condition is not satisfied is determined as the reference prediction parameter.
When the optimization condition is not satisfied, the initial prediction parameter representing the moment is the optimal prediction parameter, and the prediction accuracy is highest. When the fault prediction model performs fault prediction based on logistic regression, decision tree, support vector machine and random forest algorithm, the change condition of the initial prediction parameters is shown in fig. 9, it can be seen that no matter which prediction algorithm is adopted, when the initial prediction parameters are a proper value, the positive correlation rate of the second fault prediction result is the highest, that is, the accuracy of the fault prediction is the highest.
Step 803, determining a target storage unit to be predicted in a plurality of storage units included in the memory to be detected according to the reference prediction parameter and the first correctable error data of the memory to be detected.
In step S804, the target storage unit is divided into at least one storage unit set.
Step S805, determining a storage unit distribution characteristic of each storage unit set based on the reference prediction parameter.
And step S806, performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result.
The implementation manner of step S803 to step S806 may refer to the foregoing related description, and the repetition is not repeated here.
In the memory failure prediction method provided by the embodiment, the initial prediction parameters are optimized to obtain the reference prediction parameters, and the failure prediction processing is performed on the memory to be detected based on the reference prediction parameters, so that the accuracy of the first failure prediction result is ensured.
The embodiment also provides a memory failure prediction device, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a memory failure prediction apparatus, as shown in fig. 10, including:
A first determining module 1001, configured to determine a target storage unit to be predicted from a plurality of storage units included in the memory to be measured according to the reference prediction parameter and the first correctable error data of the memory to be measured;
a dividing module 1002, configured to divide the target storage unit into at least one storage unit set;
A second determining module 1003, configured to determine a storage unit distribution characteristic of each storage unit set based on the reference prediction parameter;
And the prediction module 1004 is configured to perform a first failure prediction process according to the storage unit distribution characteristics, so as to obtain a first failure prediction result.
In some alternative embodiments, the first determining module 1001 includes:
The mapping unit is used for mapping the address information of each storage unit included in the memory to be tested from the three-dimensional space to the two-dimensional space;
A marking unit, configured to mark at least one first storage unit corresponding to the first correctable error data in a two-dimensional space;
The first determining unit is used for determining the target storage unit to be predicted according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space.
In some alternative embodiments, the mapping unit is specifically configured to:
Determining address information of each storage unit in a three-dimensional space corresponding to the memory to be tested;
Determining the two-dimensional coordinates of each storage unit according to the address information;
address information is mapped from the stereoscopic space to the two-dimensional space according to each two-dimensional coordinate.
In some alternative embodiments, the marking unit is specifically configured to:
Acquiring target address information of at least one first storage unit generating a correctable error in a stereoscopic space from first correctable error data;
Determining target two-dimensional coordinates of each first storage unit according to the target address information;
and marking coordinate points corresponding to the two-dimensional coordinates of the target in the two-dimensional space.
In some alternative embodiments, the marking unit is specifically configured to:
Acquiring first correctable error data of the memory to be tested at intervals of a first preset time interval;
Determining whether an unlabeled candidate first storage unit exists in at least one first storage unit corresponding to the first correctable error data;
if yes, the candidate first storage unit is marked in the two-dimensional space.
In some alternative embodiments, the first determining unit includes:
the first determining subunit is used for determining the discrete information of each first storage unit according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space;
and the second determination subunit is used for determining the target storage unit to be predicted according to the discrete information.
In some alternative embodiments, the first determining subunit is specifically configured to:
According to the two-dimensional coordinates of each first storage unit in the two-dimensional space, determining first local distribution information of each first storage unit in a neighborhood corresponding to the reference prediction parameter;
and determining the discrete information of each first storage unit according to the first local distribution information.
In some alternative embodiments, the first determining subunit is further specifically configured to:
determining a current first coordinate point and a current second coordinate point; the second coordinate point is a coordinate point except the first coordinate point in the coordinate points corresponding to the two-dimensional coordinates;
determining a first distance between the first coordinate point and each second coordinate point;
Determining the neighborhood of the first coordinate point according to the first distance and the reference prediction parameter;
determining a first reachable distance from the first coordinate point to each second coordinate point in the neighborhood;
Determining a first local reachable density of the first coordinate point according to the first reachable distance;
the first local distribution information of the first storage unit corresponding to the first coordinate point includes a first local reachable density.
In some alternative embodiments, the first determining subunit is further specifically configured to:
Acquiring a target first distance from a first coordinate point to each second coordinate point in the neighborhood from the first distance;
determining the adjacent distance corresponding to the reference prediction parameters of each second coordinate point in the adjacent region;
comparing the first distance of the target of the second coordinate point with the adjacent distance to obtain a maximum distance;
the maximum distance is determined as the first reachable distance of the first coordinate point to the corresponding second coordinate point.
In some alternative embodiments, the first determining subunit is further specifically configured to:
determining a first average local reachable density of a first local reachable density of each coordinate point in a neighborhood of the first coordinate point;
determining a first local relative density of the first coordinate point according to the first average local reachable density and the first local reachable density of the first coordinate point;
The discrete information of the first memory cell corresponding to the first coordinate point includes a first local relative density.
In some alternative embodiments, the second determining subunit is further specifically configured to:
Determining whether the first local relative density is greater than a density threshold;
If not, determining the first storage unit corresponding to the first local relative density as a target storage unit to be predicted.
In some alternative embodiments, the second determining module 1003 includes:
the second determining unit is used for determining second local distribution information of each target storage unit in the storage unit set according to the reference prediction parameters and the two-dimensional coordinates of each target storage unit in the storage unit set in the two-dimensional space;
and the third determining unit is used for determining the storage unit distribution characteristics of each storage unit set according to the second local distribution information.
In some alternative embodiments, the second local distribution information comprises a second local reachable density, and the storage unit distribution feature comprises a second average local reachable density of the second local reachable density; the prediction module 1004 includes:
a fourth determining unit, configured to determine a maximum first target average local reachable density of the second average local reachable densities;
A fifth determining unit, configured to determine an average local reachable density of the first target, where the target density ranges belong to a plurality of preset density ranges;
And a sixth determining unit for determining the first failure prediction result according to the target density range.
In some optional embodiments, the apparatus further comprises an acquisition module and an optimization module:
The acquisition module is used for acquiring initial prediction parameters to be optimized;
And the optimization module is used for optimizing the initial prediction parameters according to a preset optimization rule to obtain the reference prediction parameters.
In some alternative embodiments, the optimization module includes:
The acquisition unit is used for acquiring second correctable error data of the target memory and real fault data corresponding to the second correctable error data;
The prediction unit is used for performing second fault prediction processing by using the initial prediction parameters and the second correctable error data to obtain a second fault prediction result;
a seventh determining unit, configured to determine whether the initial prediction parameter meets an optimization condition according to the second failure prediction result and the real failure data;
The optimizing unit is used for optimizing the initial prediction parameters if the optimizing conditions are met, and performing second fault prediction processing according to the optimized initial prediction parameters and second correctable error data until the initial prediction parameters are determined to not meet the optimizing conditions; if the optimization condition is not satisfied, determining the initial prediction parameter when the optimization condition is not satisfied as a reference prediction parameter.
In some alternative embodiments, the prediction unit is specifically configured to:
Determining input data based on the initial prediction parameters and the second correctable error data;
And inputting the input data into a fault prediction model to be trained for prediction processing, and obtaining a second fault prediction result.
According to the memory fault prediction device in the embodiment, a target storage unit to be predicted in a plurality of storage units included in a memory to be detected is determined according to a reference prediction parameter and first correctable error data of the memory to be detected; dividing a target storage unit into at least one storage unit set, and determining storage unit distribution characteristics of each storage unit set based on reference prediction parameters; and performing first fault prediction processing according to the distribution characteristics of the storage units to obtain a first fault prediction result. Since the distribution of memory cells that produce correctable errors is more concentrated, the risk of failure of ECC correction is often increased. Therefore, by determining the storage unit distribution characteristics of each storage unit set, accurate prediction of the memory faults can be realized based on the storage unit distribution characteristics, so that risks of data loss, service interruption and the like caused by the memory faults are reduced.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The memory failure prediction apparatus in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC (Application SPECIFIC INTEGRATED Circuit) Circuit, a processor and a memory that execute one or more software or firmware programs, and/or other devices that can provide the above functions.
The embodiment of the invention also provides computer equipment, which is provided with the memory failure prediction device shown in the figure 10.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 11, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 11.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 11.
The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or aspects in accordance with the present invention by way of operation of the computer. Those skilled in the art will appreciate that the form of computer program instructions present in a computer readable medium includes, but is not limited to, source files, executable files, installation package files, etc., and accordingly, the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (20)

1. A memory failure prediction method, the method comprising:
Determining a target storage unit to be predicted in a plurality of storage units included in a memory to be detected according to a reference prediction parameter and first correctable error data of the memory to be detected; the reference prediction parameters characterize a prediction range of the storage unit;
dividing the target storage unit into at least one storage unit set;
determining storage unit distribution characteristics of each storage unit set based on the reference prediction parameters;
And performing first fault prediction processing according to the storage unit distribution characteristics to obtain a first fault prediction result.
2. The method of claim 1, wherein determining a target memory location to be predicted from a plurality of memory locations included in the memory to be predicted based on the reference prediction parameter and the first correctable error data of the memory to be predicted comprises:
mapping address information of each storage unit included in the memory to be tested from a three-dimensional space to a two-dimensional space;
Marking at least one first storage unit corresponding to the first correctable error data in the two-dimensional space;
and determining a target storage unit to be predicted according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space.
3. The method according to claim 2, wherein mapping address information of a memory cell included in the memory to be tested from a stereoscopic space to a two-dimensional space comprises:
determining address information of each storage unit in a stereoscopic space corresponding to the memory to be tested;
determining the two-dimensional coordinates of each storage unit according to the address information;
And mapping the address information from a stereoscopic space to a two-dimensional space according to each two-dimensional coordinate.
4. A method according to claim 3, wherein said marking at least one first memory location corresponding to said first correctable error data in said two-dimensional space comprises:
Acquiring target address information of at least one first storage unit generating a correctable error in the stereoscopic space from the first correctable error data;
Determining target two-dimensional coordinates of each first storage unit according to the target address information;
and marking coordinate points corresponding to the target two-dimensional coordinates in the two-dimensional space.
5. The method of claim 2, wherein marking at least one first memory location corresponding to the first correctable error data in the two-dimensional space comprises:
Acquiring first correctable error data of the memory to be tested at intervals of a first preset time interval;
determining whether an unlabeled candidate first storage unit exists in at least one first storage unit corresponding to the first correctable error data;
If yes, marking the candidate first storage units in the two-dimensional space.
6. The method according to claim 2, wherein determining the target storage unit to be predicted based on the reference prediction parameters and the two-dimensional coordinates of each of the first storage units in the two-dimensional space comprises:
Determining discrete information of each first storage unit according to the reference prediction parameters and the two-dimensional coordinates of each first storage unit in the two-dimensional space;
And determining a target storage unit to be predicted according to the discrete information.
7. The method of claim 6, wherein determining the discrete information for each of the first storage units based on the reference prediction parameters and the two-dimensional coordinates of each of the first storage units in the two-dimensional space comprises:
Determining first local distribution information of each first storage unit in a neighborhood corresponding to the reference prediction parameter according to the two-dimensional coordinates of each first storage unit in the two-dimensional space;
and determining discrete information of each first storage unit according to the first local distribution information.
8. The method of claim 7, wherein determining the first local distribution information of each of the first storage units in the neighborhood corresponding to the reference prediction parameter according to the two-dimensional coordinates of each of the first storage units in the two-dimensional space comprises:
determining a current first coordinate point and a current second coordinate point; the second coordinate point is a coordinate point except the first coordinate point in the coordinate points corresponding to the two-dimensional coordinates;
determining a first distance between the first coordinate point and each of the second coordinate points;
determining the neighborhood of the first coordinate point according to the first distance and the reference prediction parameter;
Determining a first reachable distance from the first coordinate point to each second coordinate point in the neighborhood;
determining a first local reachable density of the first coordinate point according to the first reachable distance;
the first local distribution information of the first storage unit corresponding to the first coordinate point includes the first local reachable density.
9. The method of claim 8, wherein the determining a first reachable distance of the first coordinate point to each second coordinate point within the neighborhood comprises:
acquiring a target first distance from the first coordinate point to each second coordinate point in the neighborhood from the first distance;
determining the adjacent distance corresponding to the reference prediction parameter of each second coordinate point in the adjacent region;
Comparing the first distance of the target of the second coordinate point with the adjacent distance to obtain a maximum distance;
and determining the maximum distance as a first reachable distance from the first coordinate point to the corresponding second coordinate point.
10. The method of claim 8, wherein determining first discrete information for each of the first storage units based on the local distribution information comprises:
determining a first average local reachable density of the first local reachable density of each coordinate point in the neighborhood of the first coordinate point;
Determining a first local relative density of the first coordinate point according to the first average local reachable density and the first local reachable density of the first coordinate point;
the discrete information of the first storage unit corresponding to the first coordinate point includes the first local relative density.
11. The method of claim 10, wherein said determining a target storage unit to be predicted from said discrete information comprises:
determining whether the first local relative density is greater than a density threshold;
If not, determining the first storage unit corresponding to the first local relative density as a target storage unit to be predicted.
12. The method of claim 2, wherein the determining a storage unit distribution characteristic for each of the storage unit sets based on the reference prediction parameters comprises:
Determining second local distribution information of each target storage unit in the storage unit set according to the reference prediction parameters and the two-dimensional coordinates of each target storage unit in the storage unit set in the two-dimensional space;
and determining the storage unit distribution characteristics of each storage unit set according to the second local distribution information.
13. The method of claim 12, wherein the second local distribution information comprises a second local reachable density, and the storage unit distribution feature comprises a second average local reachable density of the second local reachable density; and performing first fault prediction processing according to the storage unit distribution characteristics to obtain a first fault prediction result, wherein the first fault prediction result comprises:
determining a first target average local reachable density that is the largest in the second average local reachable densities;
Determining the average local reachable density of the first target, wherein the target density ranges belong to a plurality of preset density ranges;
And determining a first fault prediction result according to the target density range.
14. The method of claim 1, wherein before determining the target memory location to be predicted from the plurality of memory locations included in the memory under test based on the reference prediction parameter and the first correctable error data of the memory under test, the method further comprises:
acquiring initial prediction parameters to be optimized;
and carrying out optimization processing on the initial prediction parameters according to a preset optimization rule to obtain the reference prediction parameters.
15. The method according to claim 14, wherein the optimizing the initial prediction parameters according to a preset optimization rule to obtain the reference prediction parameters includes:
acquiring second correctable error data of a target memory and real fault data corresponding to the second correctable error data;
Performing a second fault prediction process by using the initial prediction parameters and the second correctable error data to obtain a second fault prediction result;
Determining whether the initial prediction parameters meet an optimization condition according to the second fault prediction result and the real fault data;
if the optimization condition is met, optimizing the initial prediction parameters, and carrying out the second fault prediction processing according to the optimized initial prediction parameters and the second correctable error data until the initial prediction parameters are determined to not meet the optimization condition;
And if the optimization condition is not met, determining an initial prediction parameter when the optimization condition is not met as the reference prediction parameter.
16. The method of claim 15, wherein performing a second fault prediction process using the initial prediction parameters and the second correctable error data to obtain a second fault prediction result comprises:
determining input data based on the initial prediction parameters and the second correctable error data;
And inputting the input data into a fault prediction model to be trained for prediction processing, and obtaining a second fault prediction result.
17. A memory failure prediction apparatus, the apparatus comprising:
the first determining module is used for determining a target storage unit to be predicted in a plurality of storage units included in the memory to be detected according to the reference prediction parameter and first correctable error data of the memory to be detected; the reference prediction parameters characterize a prediction range of the storage unit;
the dividing module is used for dividing the target storage unit into at least one storage unit set;
a second determining module, configured to determine a storage unit distribution characteristic of each storage unit set based on the reference prediction parameter;
and the prediction module is used for performing first fault prediction processing according to the storage unit distribution characteristics to obtain a first fault prediction result.
18. A computer device, comprising:
A memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the memory failure prediction method of any of claims 1 to 16.
19. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the memory failure prediction method of any one of claims 1 to 16.
20. A computer program product comprising computer instructions for causing a computer to perform the memory failure prediction method of any one of claims 1 to 16.
CN202410372877.7A 2024-03-29 2024-03-29 Memory fault prediction method, device, equipment, storage medium and program product Pending CN117971547A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410372877.7A CN117971547A (en) 2024-03-29 2024-03-29 Memory fault prediction method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410372877.7A CN117971547A (en) 2024-03-29 2024-03-29 Memory fault prediction method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117971547A true CN117971547A (en) 2024-05-03

Family

ID=90854994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410372877.7A Pending CN117971547A (en) 2024-03-29 2024-03-29 Memory fault prediction method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117971547A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726713A (en) * 2022-03-02 2022-07-08 阿里巴巴(中国)有限公司 Node fault model training method, node fault model detection equipment, node fault model medium and node fault model product
CN115640174A (en) * 2022-09-28 2023-01-24 超聚变数字技术有限公司 Memory fault prediction method and system, central processing unit and computing equipment
WO2024041093A1 (en) * 2022-08-25 2024-02-29 超聚变数字技术有限公司 Memory fault processing method and related device thereof
CN117668706A (en) * 2023-12-13 2024-03-08 中国建设银行股份有限公司 Method and device for isolating memory faults of server, storage medium and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726713A (en) * 2022-03-02 2022-07-08 阿里巴巴(中国)有限公司 Node fault model training method, node fault model detection equipment, node fault model medium and node fault model product
WO2024041093A1 (en) * 2022-08-25 2024-02-29 超聚变数字技术有限公司 Memory fault processing method and related device thereof
CN115640174A (en) * 2022-09-28 2023-01-24 超聚变数字技术有限公司 Memory fault prediction method and system, central processing unit and computing equipment
CN117668706A (en) * 2023-12-13 2024-03-08 中国建设银行股份有限公司 Method and device for isolating memory faults of server, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
CN110309009B (en) Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium
CN106796585B (en) Conditional validation rules
TWI754664B (en) System and method for diagnosing faults
WO2016085831A1 (en) Performance anomaly diagnosis
CN112131071B (en) Memory evaluation method and device
US11860721B2 (en) Utilizing automatic labelling, prioritizing, and root cause analysis machine learning models and dependency graphs to determine recommendations for software products
US9696982B1 (en) Safe host deployment for a heterogeneous host fleet
US20170161402A1 (en) Systems and methods for dynamic regression test generation using coverage-based clustering
US9275012B2 (en) Multi-way number partitioning using weakest link optimality
CN114968652A (en) Fault processing method and computing device
US11334057B2 (en) Anomaly detection for predictive maintenance and deriving outcomes and workflows based on data quality
WO2022088632A1 (en) User data monitoring and analysis method, apparatus, device, and medium
KR20220143766A (en) Dynamic discovery and correction of data quality issues
Zhang et al. Quantifying the impact of memory errors in deep learning
CN114726713B (en) Node fault model training method, node fault model detection method, node fault model training equipment, node fault model medium and node fault model product
US20190101911A1 (en) Optimization of virtual sensing in a multi-device environment
US20180176108A1 (en) State information completion using context graphs
CN117971547A (en) Memory fault prediction method, device, equipment, storage medium and program product
JP2007164346A (en) Decision tree changing method, abnormality determination method, and program
CN115080331A (en) Fault processing method and computing device
CN117170995B (en) Performance index-based interference anomaly detection method, device, equipment and medium
US7650579B2 (en) Model correspondence method and device
US9818078B1 (en) Converting a non-workflow program to a workflow program using workflow inferencing
Cavelan et al. Detection of silent data corruptions in smoothed particle hydrodynamics simulations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination