CN117931499A

CN117931499A - Memory fault prediction method and device, electronic equipment and storage medium

Info

Publication number: CN117931499A
Application number: CN202410109420.7A
Authority: CN
Inventors: 张静; 张宪波
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-04-26

Abstract

The disclosure provides a memory failure prediction device, an electronic device and a computer readable storage medium, and relates to the technical field of computers. The memory failure prediction includes: extracting row fault information and column fault information from the fault information of the memory; predicting whether an uncorrectable error occurs in the memory based on the row fault information to obtain a first prediction result; predicting whether the memory can generate uncorrectable errors based on the column fault information to obtain a second prediction result; and determining whether the target prediction result of the uncorrectable error can appear in the memory according to the first prediction result and the second prediction result. The present disclosure can improve the accuracy of prediction of whether an uncorrectable error failure occurs in a device.

Description

Memory fault prediction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a memory failure prediction method, a memory failure prediction apparatus, an electronic device, and a computer readable storage medium.

Background

Memory failures are the most common failures of hardware systems, greatly affecting the reliability, availability and serviceability of the system. The equipment can cause memory faults due to circuit faults and the like in the operation process, the types of the memory faults can comprise UCE (Uncorrectable Error, uncorrectable errors) and CE (Correctable Error, correctable errors), if the fault type is CE, hardware can repair the fault by utilizing partial resources, and if the fault type is UCE, the system is down and restarted, and serious losses can be caused.

In order to avoid downtime, whether UCE occurs in the future is predicted by counting the number of CEs occurring in the equipment in a period of time at present so as to take intervention measures and avoid larger memory loss, however, the method has the defect of low accuracy in UCE prediction and affects the operation and maintenance of the equipment to a certain extent.

It should be noted that the information of the present invention in the above background section is only for enhancing understanding of the background of the present disclosure, and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure aims to provide a memory failure prediction method, a memory failure prediction device, an electronic device, and a computer-readable storage medium, so as to improve prediction accuracy of whether uncorrectable errors occur in the device at least to some extent.

According to a first aspect of the present disclosure, there is provided a memory failure prediction method, including: extracting row fault information and column fault information from the fault information of the memory; predicting whether an uncorrectable error occurs in the memory based on the row fault information to obtain a first prediction result; predicting whether uncorrectable errors occur in the memory based on the column fault information to obtain a second prediction result; and determining whether the target prediction result of the uncorrectable error can occur in the memory according to the first prediction result and the second prediction result.

In an exemplary embodiment, the row fault information includes a plurality of rows, and predicting whether the memory will have an uncorrectable error based on the row fault information to obtain a first prediction result includes: in a row dimension, acquiring the number of independent correctable errors in the row fault information of each row; determining a row prediction result corresponding to each row according to the number of the independent correctable errors, wherein the row prediction result is used for indicating whether uncorrectable errors occur in the corresponding row; and determining whether uncorrectable errors occur in the memory according to the row prediction results of the rows.

In an exemplary embodiment, the acquiring, in the row dimension, the number of independently correctable errors in the row fault information of each row includes: determining, for each row of the row fault information, whether the adjacent correctable errors are the independent correctable errors according to a physical distance between the adjacent correctable errors; the number of independently correctable errors in each row is counted to obtain the number of independently correctable errors for that row.

In an exemplary embodiment, the determining, for each row of the row fault information, whether the adjacent correctable errors are the independent correctable errors according to a physical distance between the adjacent correctable errors includes: if the physical distance between the adjacent correctable errors is greater than a preset distance threshold, determining that the adjacent correctable errors are all independent correctable errors; the preset distance threshold is output after training a model to be trained based on the historical fault information of the memory.

In an exemplary embodiment, the determining the row prediction result corresponding to each row according to the number of the independently correctable errors includes: if the number of the independent correctable errors is larger than a first number threshold, determining that uncorrectable errors occur in the corresponding rows; the first quantity threshold is output after training a model to be trained based on the historical fault information of the memory; the determining whether the memory has uncorrectable errors according to the row prediction result of each row includes: if the row prediction result of at least one row indicates that the row can generate uncorrectable errors, determining that the memory can generate uncorrectable errors.

In an exemplary embodiment, the column fault information includes a plurality of columns, and predicting whether the memory will have an uncorrectable error based on the column fault information to obtain a second prediction result includes: in a column dimension, acquiring the number of correctable errors in the column fault information of each column; determining a column prediction result corresponding to each column according to the number of the correctable errors, wherein the column prediction result is used for indicating whether uncorrectable errors occur in the corresponding column; and determining whether the memory is uncorrectable errors according to the column prediction results of each column.

In an exemplary embodiment, the determining the column prediction result corresponding to each column according to the number of correctable errors includes: if the number of the correctable errors is greater than a second number threshold, determining that uncorrectable errors occur in the corresponding columns; the second quantity threshold value is output after training a model to be trained based on the historical fault information of the memory; the determining whether the memory will have uncorrectable errors according to the column prediction results of the columns includes: if there is at least one column for which the column prediction indicates that an uncorrectable error may occur, then it is determined that an uncorrectable error may occur in the memory.

In an exemplary embodiment, the extracting row fault information and column fault information from the fault information of the memory includes: obtaining error position information from the fault information of the memory; extracting the row fault information and column fault information from the error position information; wherein the error location information includes at least slot information.

In an exemplary embodiment, the method further comprises: and determining a root cause position which causes the uncorrectable error of the memory from the first prediction result and/or the second prediction result in response to the target prediction result indicating that the uncorrectable error of the memory can occur.

According to a second aspect of the present disclosure, there is provided a memory failure prediction apparatus, including: the information acquisition module is used for extracting row fault information and column fault information from the fault information of the memory; the line prediction module is used for predicting whether uncorrectable errors occur in the memory or not based on the line fault information to obtain a first prediction result; the column prediction module is used for predicting whether uncorrectable errors occur in the memory or not based on the column fault information to obtain a second prediction result; and the memory fault prediction module is used for determining whether an uncorrectable error target prediction result occurs in the memory according to the first prediction result and the second prediction result.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to implement the above-described method via execution of the executable instructions.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described above.

According to the memory fault prediction method provided by the embodiment of the disclosure, row fault information and column fault information are extracted from the fault information of the memory, whether uncorrectable errors occur in the memory is predicted based on the row fault information, whether uncorrectable errors occur in the memory is predicted based on the column fault information, so that double judgment is performed on the fault location of the memory in the row dimension and the column dimension according to the first prediction result and the second prediction result, whether uncorrectable errors occur in the memory is determined, the prediction accuracy of uncorrectable error faults of the memory is improved, and further uncorrectable error faults of the memory can be intervened in advance, and heavy losses such as downtime are avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

Fig. 1 shows a schematic diagram of an application scenario in which embodiments of the present disclosure may be applied.

Fig. 2 schematically illustrates a flowchart of a memory failure prediction method in an exemplary embodiment of the present disclosure.

Fig. 3 schematically illustrates a flowchart of an implementation of predicting memory failure based on row failure information in an exemplary embodiment of the present disclosure.

Fig. 4 schematically illustrates a flowchart of one implementation of determining the number of independently correctable errors in a row dimension in an exemplary embodiment of the present disclosure.

Fig. 5 schematically illustrates a flowchart of an implementation of predicting memory failure based on column failure information in an exemplary embodiment of the present disclosure.

Fig. 6 schematically illustrates a flowchart for training a model to be trained in an exemplary embodiment of the present disclosure.

Fig. 7 schematically illustrates a flowchart of memory failure prediction for a server in an exemplary embodiment of the present disclosure.

Fig. 8 is a schematic diagram showing the constitution of a memory failure prediction device to which the exemplary embodiments of the present disclosure can be applied.

Fig. 9 shows a schematic diagram of the composition of an electronic device to which the exemplary embodiments of the present disclosure may be applied.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 shows a schematic diagram of an application scenario, as in fig. 1, in which embodiments of the present disclosure may be applied, in which a first device 11 and at least one second device 12 may be included. The first device 11 is configured to predict a memory failure of the second device 12, specifically, whether the second device has an uncorrectable error. The second device 12 is a device to be predicted, and is any type of electronic device that needs to predict whether a memory failure will occur, specifically, whether an uncorrectable error will occur, for example, the second device 12 may be a physical machine, which refers to a physical server where a virtualized computer system is deployed, and is a specific entity.

Wherein the second device 12 comprises a memory controller, the memory controller may be integrated in the CPU of the second device 12. In response to a write command from the CPU, the memory controller may generate a corresponding error correction code (Error Correction Code, ECC) based on the data to be written to the memory, and write the data and the error correction code to the memory. Correspondingly, the memory controller can also respond to the read instruction of the CPU, and can read data and error correction codes from the memory, if CE occurs, the repair is performed according to the error correction codes, and if UCE occurs, the hardware system can send UCE signals and cause the second equipment 12 to be down.

In the related art, the number of CEs occurring in the device in a period of time is counted, and when the number of CEs occurring in a unit time reaches a threshold value, an intervention measure (such as a line page) is adopted to prevent the UCE from occurring, so that larger memory loss is avoided.

However, the simple counting mode has the defect of insufficient accuracy in predicting the memory faults and solving the downtime of the airport.

Accordingly, based on one or more of the above-mentioned problems, the embodiments of the present disclosure provide a memory failure prediction method, which determines whether an uncorrectable error will occur in a memory through finer granularity dimensions, i.e., a row dimension and a column dimension, respectively, so as to improve the accuracy of predicting a memory failure.

It should be noted that, in fig. 1, memory failure prediction is performed on a device to be predicted by other devices than the device to be predicted, and it is understood that, in some embodiments, memory failure prediction may also be performed by the device to be predicted itself.

Fig. 2 illustrates a flowchart of a memory failure prediction method according to an exemplary embodiment of the present disclosure, and an execution subject of an embodiment of the present disclosure may be a device to be predicted itself or may be another device other than the device to be predicted. As shown in fig. 2, the memory failure prediction method according to the embodiment of the present disclosure may include steps S210 to S240:

in step S210, row fault information and column fault information are extracted from the fault information of the memory.

In an exemplary embodiment of the present disclosure, a Row failure is a CE failure or UCE failure occurring in a Row (Row) in a memory physical granularity, and a Column failure is a CE failure or UCE failure occurring in a Column (Column) in a memory physical granularity. Row fault information corresponding to the row dimension and column fault information corresponding to the column dimension may be extracted from the fault information of the memory.

In some alternative embodiments, the memory raw data may be first extracted and the raw data may be preprocessed to extract row fault information and column fault information from the preprocessed raw data, respectively. Optionally, duplicate data exists in the original memory data, and the original data may be subjected to deduplication processing, for example, to remove memory failure information at the same time and at the same location. Alternatively, the original data may be subjected to format conversion, for example, the extracted row fault information and column fault information may be hexadecimal, and the hexadecimal information may be converted into decimal for convenience of subsequent processing. Of course, the method of preprocessing the original data in the embodiment of the disclosure is not limited thereto, and the original data may be preprocessed according to actual requirements.

The fault information of the memory generally includes various types of features including, but not limited to, server IP, manufacturer, fault type, fault location information, etc., and the fault location information may be obtained from the fault information of the memory, and then row fault information and column fault information may be extracted from the fault location information. The presence location information may include slot information, among other things.

Specifically, slot information (slot information) may be obtained according to a slot index, including but not limited to Channel, dimm, rank, bank, column, row, etc. in a physical structure of a memory bank, and then the slot information is parsed to obtain Row fault information and column fault information.

It should be noted that, for the specific description of Channel, dimm, rank, bank, column and Row, reference may be made to the specific description in the related art, and the description is omitted.

In step S220, it is predicted whether the memory will have uncorrectable errors based on the row fault information, so as to obtain a first prediction result.

In the exemplary embodiments of the present disclosure, considering that the nature of a memory failure is generally related to a circuit failure, a row failure and a column failure may cause an influence of multiple pages (memory banks) once they occur, if a row and column distribution condition of CEs is not considered, only by simple CE counting, the occurrence rate of CEs lacks accuracy in predicting whether UCE will occur in the future. Based on this, the embodiment of the disclosure may predict whether the memory will have uncorrectable errors according to the row fault information, so as to obtain a first prediction result. The specific implementation of predicting whether the memory will have uncorrectable errors based on the CE distribution in the row dimension will be described later.

In step S230, it is predicted whether the memory will have uncorrectable errors based on the column fault information, and a second prediction result is obtained.

In an exemplary embodiment of the present disclosure, it is predicted whether an uncorrectable error will occur in a memory according to column fault information, while it is predicted whether an uncorrectable error will occur in a memory according to row fault information. The prediction process for the column dimension and the prediction process for the row dimension are independent and do not affect each other, so that accurate predictions can be performed in different dimensions.

In some alternative embodiments, the way CE is counted may be different in the prediction process in the column and row dimensions due to differences in CE distribution in different dimensions. In particular, CEs occurring at closely spaced locations in the row dimension have limited impact on the overall device and thus can be predicted from independently correctable errors in each row, while CE distributions in the column dimension themselves span much larger and can be predicted from CE counts per row.

Of course, in some alternative embodiments, a manner of predicting according to the CE count of each row (or each column) may be adopted in both the row dimension and the column dimension, or each row (or each column) may be predicted by using different prediction manners, and prediction results obtained by different prediction manners may be combined, for example, for the row dimension, prediction may be performed by using a statistically independent and correctable error manner and prediction may be performed by using a simple and correctable error manner, and a row prediction result of the row may be determined by combining two prediction results.

In step 240, it is determined whether the memory will present an uncorrectable erroneous target prediction based on the first prediction and the second prediction.

In an exemplary embodiment of the present disclosure, after the first prediction result and the second prediction result are obtained, whether an uncorrectable error occurs in the memory may be determined according to the first prediction result and the second prediction result, so as to obtain the target prediction result.

And if at least one prediction result indicates that the memory can generate uncorrectable errors in the first prediction result and the second prediction result, the memory can be predicted to generate uncorrectable errors. That is, whether an uncorrectable error occurs in the row dimension prediction memory or the column dimension prediction memory, it may reflect that an uncorrectable error occurs in the future in the prediction.

According to the memory fault prediction method provided by the embodiment of the disclosure, the row fault information and the column fault information are extracted from the fault information of the memory, whether the memory can generate uncorrectable errors is predicted based on the row fault information, whether the memory can generate uncorrectable errors is predicted based on the column fault information, so that double judgment is performed on the fault location of the memory in the row dimension and the column dimension according to the first prediction result and the second prediction result, whether the memory can generate uncorrectable errors is determined, the prediction accuracy of the uncorrectable errors of the memory is improved, and further the uncorrectable errors of the memory can be intervened in advance, and heavy losses such as downtime are avoided.

In an exemplary embodiment, an implementation of predicting memory failure based on row failure information is provided. As shown in fig. 3, predicting whether an uncorrectable error will occur in the memory based on the row fault information, and obtaining the first prediction result may include steps S310 to S330:

step S310, in the row dimension, obtaining the number of independent correctable errors in row fault information of each row; step S320: determining a row prediction result corresponding to each row according to the number of the independent correctable errors; step S330: and determining whether the memory has uncorrectable errors according to the row prediction result of each row.

Where an independently correctable error (unique error) means that CEs that occur at a far distance in the row dimension will have a large and serious impact on the failure of the device, and CEs that are far apart in the row dimension are determined to be independently correctable errors.

The row fault information includes a plurality of rows, and the embodiment of the disclosure analyzes the distribution situation of the independent correctable errors in the row fault information of each row in the row dimension, so as to count the number of the independent correctable errors in the row fault information of each row, namely, count the independent CEs in one row.

In an alternative embodiment, an implementation is provided that determines a number of independently correctable errors in a row dimension. As shown in fig. 4, in the row dimension, acquiring the number of independently correctable errors in the row fault information of each row may further include step S410 and step S420:

step S410: for each row of row fault information, determining whether the adjacent correctable errors are independent correctable errors based on the physical distance between the adjacent correctable errors.

The physical distance between adjacent correctable errors refers to the distance of the adjacent correctable errors on the physical structure of the memory bank. Embodiments of the present disclosure may determine, for each row, whether adjacent correctable errors are independent correctable errors based on the physical distance between adjacent correctable errors in the row.

If the physical distance between the adjacent correctable errors is greater than the preset distance threshold, determining that the adjacent correctable errors are all independent correctable errors. The preset distance threshold is output after training the model to be trained based on the historical fault information of the memory.

Specifically, the model to be trained can be trained by adopting the historical fault information, after model training is completed, output parameters are obtained, and a preset distance threshold value is determined from the output parameters. The historical fault information may be memory fault data of the device in the past several months, such as memory fault data of the past 2 months, and the time period for obtaining the historical fault information is not particularly limited in the embodiments of the present disclosure. The training process of the model will be described later.

Step S420: the number of independently correctable errors in each row is counted to obtain the number of independently correctable errors for that row.

After the number of independently correctable errors in each row is obtained, the number may be taken as the number of independently correctable errors for that row.

According to the embodiment of the disclosure, the influence of CEs which occur at a relatively close position on the whole equipment is considered to be limited, and only independent CEs in each row are counted in a row dimension according to the physical distance among CEs, so that statistics of CEs can be performed across pages (memory banks), the situation that UCE does not exist in a certain page but occurs in the future is avoided, the independent CEs are counted from the distribution characteristic of the CEs in the row dimension, and the prediction effect of the occurrence rate of the CEs on UCE in the future is improved.

The row prediction result of the embodiment of the present disclosure is used to indicate whether an uncorrectable error may occur in the corresponding row. By counting the number of independently correctable errors per row, it can be predicted whether an uncorrectable error will occur for that row.

In an exemplary embodiment, determining the row prediction result corresponding to each row according to the number of independently correctable errors may include:

If the number of independently correctable errors is greater than a first number threshold, it is determined that an uncorrectable error will occur for the corresponding row. The first quantity threshold is output after training the model to be trained based on the historical fault information of the memory.

It should be noted that, when the to-be-trained model is trained by using the historical fault information of the memory, the to-be-trained model and the to-be-trained model are the same model, that is, after the to-be-trained model is trained, the first number of thresholds may also be obtained from the output parameters, and a specific training process will be described below.

Based on the foregoing embodiments, determining whether the memory will have uncorrectable errors according to the row prediction result of each row may include:

if the row prediction result of at least one row indicates that the row can generate uncorrectable errors, determining that the memory can generate uncorrectable errors. That is, since the uncorrectable errors of the memory are intolerable, in the row fault information of all rows, as long as there is a row prediction result of a row indicating that the uncorrectable errors occur in the row, the prediction result of the row dimension is determined as the uncorrectable errors occurring in the memory.

In an alternative embodiment, after the obtained row fault information is sequenced according to the occurrence sequence, the row fault information of each device in the row dimension is used for predicting whether an uncorrectable error occurs in the row, once a prediction result of a certain row is that the uncorrectable error occurs in the prediction, the prediction result of the row dimension is determined to be that the memory is that the uncorrectable error occurs, and then the prediction of the row dimension can be jumped out, so that the statistical analysis of the row fault information of all the rows is avoided, and the uncorrectable error efficiency of the prediction memory is improved.

In an alternative embodiment, after the obtained row fault information is ordered according to the occurrence sequence, the row fault information of each device in the row dimension is sequentially used to predict whether an uncorrectable error occurs in the row until the last row is predicted to be finished. Based on the method, a row prediction result of each row can be obtained in the row dimension, and a row position basis is provided for the accurate status of the subsequent faults.

In the embodiment of the disclosure, considering the distribution characteristics of the correctable errors in the row dimension, the independent correctable errors are used as the statistical granularity in the row dimension, compared with the method of simply counting the number of uncorrectable errors, the CE with higher influence on the whole equipment can be fully mined in the row dimension, the statistical granularity is finer, and the fault prediction accuracy of the memory is further improved.

In an exemplary embodiment, an implementation of predicting memory failure based on column failure information is provided. As shown in fig. 5, predicting whether an uncorrectable error will occur in the memory based on the column fault information, and obtaining the second prediction result may include steps 510 to 530:

step S510: in the column dimension, the number of correctable errors in column fault information for each column is obtained.

The column fault information in the embodiment of the present disclosure includes a plurality of columns, and since the distribution span of the correctable errors in the column dimension is generally larger, the effect of counting the independent correctable errors of each column is not obvious, and based on the distribution characteristics of the correctable errors in the column dimension, the number of correctable errors in the column fault information of each column is obtained in the column dimension.

Step S520: and determining a column prediction result corresponding to each column according to the number of correctable errors.

The column prediction results are used to indicate whether an uncorrectable error will occur for the corresponding column. Wherein determining a column prediction result corresponding to each column based on the number of correctable errors may include:

if the number of correctable errors is greater than the second number threshold, it is determined that uncorrectable errors will occur for the corresponding column. The second number threshold is output after training the model to be trained based on the historical fault information of the memory.

It should be noted that, when the to-be-trained model is trained by using the historical fault information of the memory, the to-be-trained model and the to-be-trained model are the same model, that is, after the to-be-trained model is trained, the second number of thresholds may be obtained from the output parameters, and a specific training process will be described below.

Embodiments of the present disclosure fully consider the distribution of correctable errors in the column dimension by comparing the number of correctable errors per column to a second number threshold to determine whether an uncorrectable error will occur for that column in the column dimension.

Based on the foregoing embodiment, step S530 determines whether the memory will have uncorrectable errors according to the column prediction results of the columns, which may include:

if there is at least one column for which the column prediction indicates that an uncorrectable error may occur, then it is determined that an uncorrectable error may occur in the memory. That is, since the uncorrectable errors of the memory are intolerable, in the column fault information of all columns, as long as there is a column prediction result of one column indicating that the uncorrectable errors occur in the column, the prediction result of the column dimension is determined that the uncorrectable errors occur in the memory.

In an alternative embodiment, after the obtained column fault information is ordered according to the occurrence sequence, the column fault information of each device in the column dimension is used for predicting whether an uncorrectable error occurs in the column, once a prediction result of a certain column is that the uncorrectable error occurs in the prediction, the prediction result of the column dimension is determined to be that the memory is that the uncorrectable error occurs, and then the prediction of the column dimension can be jumped out, so that the statistical analysis of the column fault information of all columns is avoided, and the uncorrectable error efficiency of the prediction memory is improved.

In an alternative embodiment, after the obtained column fault information is ordered according to the occurrence sequence, the column fault information of each device in the column dimension is sequentially used to predict whether an uncorrectable error occurs in the column until the last column is predicted to be finished. Based on the method, the column prediction result of each column can be obtained in the column dimension, and a column position foundation is provided for the accurate position of the subsequent faults.

Based on the foregoing embodiments, after the first prediction result and the second prediction result are obtained, since the uncorrectable error of the memory is intolerable, whether the first prediction result indicates that the memory may have an uncorrectable error, the second prediction result indicates that the memory may have an uncorrectable error, or both the first prediction result and the second prediction result indicate that the memory may have an uncorrectable error, it is determined that the memory may have an uncorrectable error in the future. Based on the distribution characteristics of the correctable errors in the row dimension and the column dimension, the method analyzes the row dimension and the column dimension respectively, and mines the influence of the correctable errors in the row dimension and the column dimension on the uncorrectable errors in the memory with smaller statistical granularity.

Fig. 6 is a flowchart for training a model to be trained according to an exemplary embodiment of the present disclosure, and a process of training the model to be trained to obtain a preset distance threshold L _r, a first number threshold θ _row, and a second number threshold θ _col will be described below with reference to fig. 6, taking a memory failure prediction of a server as an example.

Step S610: and acquiring fault information of the server equipment.

Wherein fault log information for the server device may be obtained, the log information comprising a plurality of categories of features, such as server IP, vendor, fault type, error location information, etc.

Step S620: after the fault information is preprocessed, row fault information and column fault information are extracted from the preprocessed fault information.

The log information can be extracted for 2 months, then the log information is subjected to duplication removal processing, and row fault information and column fault information are extracted from duplication removal results based on slot information. In addition, the extracted row fault information and column fault information may also be converted into a decimal system, such as hexadecimal.

Step S630: initial parameters of the model to be trained are determined.

Taking log information obtained for 2 months as an example, all the equipment data where UCE occurs, and CE data where UCE occurs are extracted.

The distance between adjacent CEs can be counted in the row dimension, and an average value is calculated and used as an initial value of L _r; then, based on the initial value of L _r, acquiring the physical distance between adjacent UE in each row, comparing the physical distance with the initial value of L _r, and determining the number of independent correctable errors (unique errors) in each row according to the comparison result so as to average the number of independent correctable errors of each row to obtain an initial value of theta _row; finally, in the column dimension, counting the CE number of each column by adopting a mode of counting the CE number, and solving the average value of the CE number of each column as the initial value of θ _col.

It should be noted that, in the embodiment of the present disclosure, the initial values of the preset distance threshold L _r, the first number threshold θ _row, and the second number threshold θ _col may also be determined by other manners, for example, the value of the last iteration is used as the initial value of the present training, and for example, the device data of the last UCE is determined to be each initial value, which is not limited in particular in this embodiment of the present disclosure.

Step S640: and (5) adjusting model parameters.

And (3) taking the initial values of the determined preset distance threshold L _r, the first quantity threshold theta _row and the second quantity threshold theta _col as model parameters, verifying on a dataset, and obtaining different prediction accuracy and recall rates through different parameter combinations.

The parameter L _r is a distance threshold for judging the independent correctable errors, and under the same condition, the larger the parameter L _r is, the more strict the judging condition for the independent correctable errors is; parameters θ _row and θ _col are the number thresholds for determining whether uncorrectable error faults occur in the future, and the larger θ _row and θ _col, the fewer servers that predict future UCEs. Based on the above, multiple groups of experimental results are obtained by continuously adjusting the values of the parameters, so that a group of parameter values with the best performance is selected as the model target parameters.

For example, the values of the three parameters may be adjusted according to the accuracy and recall of the parameters in the test dataset to gradually obtain a set of well-behaved parameters.

In the model training process, log information of equipment faults is taken as input, a memory fault prediction result is taken as output, a trained model is obtained by continuously adjusting parameters, and final values of the three parameters are obtained correspondingly.

Step S650: and (5) model verification.

And selecting the prediction effect of the verification model of the verification data set based on the determined final values of the three parameters. As shown in fig. 7, the fault information of the server for 1 month may be selected, and row fault information and column fault information may be extracted, respectively, to perform fault prediction in the row dimension and the column dimension, respectively.

In step S710, for each row of failure information of the server, it is predicted whether an uncorrectable error failure will occur in the memory.

Wherein, for each line of fault information of the server, L _r is used as a preset distance threshold to judge whether CE is unique error, count unique error of the line, if the number is greater than θ _row, set flag to 1, and exit the prediction of line dimension.

In step 720, for each column of failure information of the server, it is predicted whether an uncorrectable error failure will occur in the memory.

The number of CEs can be counted for each column of fault information of the server, if the number of CEs is greater than θ _col, the flag is set to be 1, and the prediction of the column dimension is exited. The execution sequence of step S710 and step S720 is not particularly limited.

Step 730: and determining whether the memory can generate uncorrectable error faults according to the first prediction result of the row dimension and the second prediction result of the column dimension.

After the detection of the row and column is completed, a union of prediction results of the row and column dimension can be taken, and if the flag is 1, UCE faults can be predicted to occur in the future of the server; otherwise, it is predicted that the server will not fail UCE in the future.

According to the embodiment of the disclosure, a higher server recall rate is realized through a one-time prediction process, so that a failed server can be found as early as possible, and the risks such as downtime and the like are avoided by intervention in advance.

It should be noted that, in the embodiment of the present disclosure, through a training process of a model to be trained, a preset distance threshold L _r, a first number threshold θ _row, and a second number threshold θ _col may be determined, and in actual prediction, the obtained fault information of the memory may be input into the model to output a memory fault prediction result. That is, the steps of the above embodiments may be processed by the model based on training the completion model, so as to quickly obtain the target prediction result of whether the uncorrectable error failure occurs in the server.

It should be noted that, the details of the steps S610 to S650 and the steps S710 to S730 are described in the above exemplary embodiments, and are not repeated here.

In an exemplary embodiment, an implementation of fault localization is also provided. After determining whether the memory has an uncorrectable error according to the first prediction result and the second prediction result, the target prediction result can also respond to the target prediction result to indicate that the memory has an uncorrectable error, and the root cause position causing the memory to have the uncorrectable error can be determined from the first prediction result and/or the second prediction result.

Specifically, after determining that an uncorrectable error occurs in the memory, if the result is dominant because the first prediction result indicates that an uncorrectable error occurs in the memory, the target row may be determined as a root cause location according to a target row prediction result indicating that an uncorrectable error occurs in the memory in the first prediction result, so as to solve the fault.

If the result is dominant because the second prediction result indicates that the memory will have uncorrectable errors, the target column may be determined according to the target column prediction result in the second prediction result, where the second prediction result indicates that the memory will have uncorrectable errors, as the root cause location, so as to solve the fault.

Of course, if the result is dominant because the first prediction result and the second prediction result both indicate that the memory will have uncorrectable errors, the target row may be determined according to the target row prediction result indicating that the memory will have uncorrectable errors in the first prediction result, and the target column may be determined according to the target column prediction result indicating that the memory will have uncorrectable errors in the second prediction result, so as to obtain the root cause position, so as to solve the fault.

Above, as long as prediction memory in any one of row dimension and column dimension will appear uncorrectable error in the future, problem location can be realized fast, measures are taken in advance, and larger losses are avoided.

As can be seen from the foregoing, the memory fault prediction method provided by the embodiment of the present disclosure extracts the row fault information and the column fault information from the fault information of the memory, predicts whether the memory will have uncorrectable errors based on the row fault information, predicts whether the memory will have uncorrectable errors based on the column fault information, and makes the fault location of the memory perform double judgment in the row dimension and the column dimension according to the first prediction result and the second prediction result, so as to determine whether the memory will have uncorrectable errors, improve the prediction accuracy of the uncorrectable errors of the memory, further perform early intervention on the uncorrectable errors of the memory, and avoid serious losses such as downtime.

It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Further, referring to fig. 8, in an exemplary embodiment of the present disclosure, a memory failure prediction apparatus 800 is provided, including an information obtaining module 810, a row prediction module 820, a column prediction module 830, and a memory failure prediction module 840, where:

An information obtaining module 810, configured to extract row fault information and column fault information from the fault information of the memory; a line prediction module 820, configured to predict whether an uncorrectable error will occur in the memory based on the line fault information, to obtain a first prediction result; a column prediction module 830, configured to predict whether an uncorrectable error will occur in the memory based on the column fault information, to obtain a second prediction result; the memory failure prediction module 840 is configured to determine whether an uncorrectable error target prediction result will occur in the memory according to the first prediction result and the second prediction result.

In an exemplary embodiment, the row fault information includes a plurality of rows therein, and the row prediction module 820 is configured to perform: in a row dimension, acquiring the number of independent correctable errors in the row fault information of each row; determining a row prediction result corresponding to each row according to the number of the independent correctable errors, wherein the row prediction result is used for indicating whether uncorrectable errors occur in the corresponding row; and determining whether uncorrectable errors occur in the memory according to the row prediction results of the rows.

In an exemplary embodiment, the row prediction module 820 is configured to perform: determining, for each row of the row fault information, whether the adjacent correctable errors are the independent correctable errors according to a physical distance between the adjacent correctable errors; the number of independently correctable errors in each row is counted to obtain the number of independently correctable errors for that row.

In an exemplary embodiment, the row prediction module 820 is configured to perform: if the physical distance between the adjacent correctable errors is greater than a preset distance threshold, determining that the adjacent correctable errors are all independent correctable errors; the preset distance threshold is output after training a model to be trained based on the historical fault information of the memory.

In an exemplary embodiment, the row prediction module 820 is configured to perform: if the number of the independent correctable errors is larger than a first number threshold, determining that uncorrectable errors occur in the corresponding rows; the first quantity threshold is output after training a model to be trained based on the historical fault information of the memory; and the row prediction module 820 is configured to perform: if the row prediction result of at least one row indicates that the row can generate uncorrectable errors, determining that the memory can generate uncorrectable errors.

In an exemplary embodiment, the column fault information includes a plurality of columns therein, and the column prediction module 830 is configured to perform: in a column dimension, acquiring the number of correctable errors in the column fault information of each column; determining a column prediction result corresponding to each column according to the number of the correctable errors, wherein the column prediction result is used for indicating whether uncorrectable errors occur in the corresponding column; and determining whether the memory is uncorrectable errors according to the column prediction results of each column.

In an exemplary embodiment, column prediction module 830 is configured to perform: if the number of the correctable errors is greater than a second number threshold, determining that uncorrectable errors occur in the corresponding columns; the second quantity threshold value is output after training a model to be trained based on the historical fault information of the memory; the determining whether the memory will have uncorrectable errors according to the column prediction results of the columns includes: if there is at least one column for which the column prediction indicates that an uncorrectable error may occur, then it is determined that an uncorrectable error may occur in the memory.

In an exemplary embodiment, the memory failure prediction module 840 is configured to perform: obtaining error position information from the fault information of the memory; extracting the row fault information and column fault information from the error position information; wherein the error location information includes at least slot information.

In an exemplary embodiment, the memory failure prediction module 840 is further configured to perform: and determining a root cause position which causes the uncorrectable error of the memory from the first prediction result and/or the second prediction result in response to the target prediction result indicating that the uncorrectable error of the memory can occur.

The specific details of each module in the above memory failure prediction apparatus are already described in the method portion of the embodiments, and details not disclosed herein may refer to the embodiment of the method portion, that is, the explanation and the beneficial effects of the memory failure prediction method of the above embodiment are also applicable to the memory failure prediction apparatus 800 of the embodiment of the disclosure, which is not further described herein in detail.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to such an embodiment of the present disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting the different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 such that the processing unit 910 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification.

The storage unit 920 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 9201 and/or cache memory 9202, and may further include Read Only Memory (ROM) 9203.

The storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 930 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 950. Also, electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 960. As shown, the network adapter 960 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

Furthermore, exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. The memory fault prediction method is characterized by comprising the following steps of:

Extracting row fault information and column fault information from the fault information of the memory;

Predicting whether an uncorrectable error occurs in the memory based on the row fault information to obtain a first prediction result;

Predicting whether uncorrectable errors occur in the memory based on the column fault information to obtain a second prediction result;

And determining whether the target prediction result of the uncorrectable error can occur in the memory according to the first prediction result and the second prediction result.

2. The method of claim 1, wherein the row fault information includes a plurality of rows, and the predicting whether the memory will have an uncorrectable error based on the row fault information, to obtain the first prediction result, includes:

in a row dimension, acquiring the number of independent correctable errors in the row fault information of each row;

determining a row prediction result corresponding to each row according to the number of the independent correctable errors, wherein the row prediction result is used for indicating whether uncorrectable errors occur in the corresponding row;

and determining whether uncorrectable errors occur in the memory according to the row prediction results of the rows.

3. The method of claim 2, wherein the obtaining the number of independently correctable errors in the row fault information for each row in the row dimension comprises:

determining, for each row of the row fault information, whether the adjacent correctable errors are the independent correctable errors according to a physical distance between the adjacent correctable errors;

the number of independently correctable errors in each row is counted to obtain the number of independently correctable errors for that row.

4. A method according to claim 3, wherein said determining, for each row of said row fault information, whether an adjacent correctable error is said independently correctable error based on a physical distance between said adjacent correctable errors comprises:

If the physical distance between the adjacent correctable errors is greater than a preset distance threshold, determining that the adjacent correctable errors are all independent correctable errors;

the preset distance threshold is output after training a model to be trained based on the historical fault information of the memory.

5. The method of claim 2, wherein determining a corresponding row prediction result for each row based on the number of independently correctable errors comprises:

if the number of the independent correctable errors is larger than a first number threshold, determining that uncorrectable errors occur in the corresponding rows; the first quantity threshold is output after training a model to be trained based on the historical fault information of the memory;

The determining whether the memory has uncorrectable errors according to the row prediction result of each row includes:

If the row prediction result of at least one row indicates that the row can generate uncorrectable errors, determining that the memory can generate uncorrectable errors.

6. The method of claim 1, wherein the column fault information includes a plurality of columns, and the predicting whether the memory will have an uncorrectable error based on the column fault information, to obtain a second prediction result includes:

in a column dimension, acquiring the number of correctable errors in the column fault information of each column;

determining a column prediction result corresponding to each column according to the number of the correctable errors, wherein the column prediction result is used for indicating whether uncorrectable errors occur in the corresponding column;

and determining whether the memory is uncorrectable errors according to the column prediction results of each column.

7. The method of claim 6, wherein determining a column prediction result corresponding to each column based on the number of correctable errors comprises:

If the number of the correctable errors is greater than a second number threshold, determining that uncorrectable errors occur in the corresponding columns; the second quantity threshold value is output after training a model to be trained based on the historical fault information of the memory;

the determining whether the memory will have uncorrectable errors according to the column prediction results of the columns includes:

if there is at least one column for which the column prediction indicates that an uncorrectable error may occur, then it is determined that an uncorrectable error may occur in the memory.

8. The method according to any one of claims 1 to 7, wherein the extracting row fault information and column fault information from the fault information of the memory includes:

obtaining error position information from the fault information of the memory;

Extracting the row fault information and column fault information from the error position information;

wherein the error location information includes at least slot information.

9. The method according to any one of claims 1 to 7, further comprising:

And determining a root cause position which causes the uncorrectable error of the memory from the first prediction result and/or the second prediction result in response to the target prediction result indicating that the uncorrectable error of the memory can occur.

10. A memory failure prediction apparatus, comprising:

the information acquisition module is used for extracting row fault information and column fault information from the fault information of the memory;

The line prediction module is used for predicting whether uncorrectable errors occur in the memory or not based on the line fault information to obtain a first prediction result;

the column prediction module is used for predicting whether uncorrectable errors occur in the memory or not based on the column fault information to obtain a second prediction result;

And the memory fault prediction module is used for determining whether an uncorrectable error target prediction result occurs in the memory according to the first prediction result and the second prediction result.

11. An electronic device, comprising:

A processor; and

A memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 9 via execution of the executable instructions.

12. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1 to 9.