CN118245348A

CN118245348A - Memory fault prediction maintenance method, device, equipment and medium

Info

Publication number: CN118245348A
Application number: CN202410446800.XA
Authority: CN
Inventors: 李鉴洋
Original assignee: Inspur Computer Technology Co Ltd
Current assignee: Inspur Computer Technology Co Ltd
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-06-25

Abstract

The invention discloses a memory fault prediction maintenance method, device, equipment and medium, and relates to the technical field of server memories. According to the scheme, the operation parameters of the target memory are monitored, the operation parameters are input into a memory life prediction model, the predicted life of the target memory can be accurately obtained, and the predicted life is compared with a dynamic alarm threshold; when the predicted life is smaller than the dynamic alarm threshold value, confirming that the target memory has faults; because the dynamic alarm threshold corresponds to the current theoretical life stage of the target memory, and different theoretical life stages correspond to different dynamic alarm thresholds, the scheme can effectively improve the accuracy of memory fault alarm; in addition, after confirming the fault of the target memory, the target memory is further maintained according to the fault type, so that the service life of the memory is prolonged, further damage of the memory caused by the fault can be effectively avoided, and the reliability and durability of the server equipment can be improved to the greatest extent.

Description

Memory fault prediction maintenance method, device, equipment and medium

Technical Field

The present invention relates to the field of server memory technologies, and in particular, to a method, an apparatus, a device, and a medium for predicting and maintaining a memory failure.

Background

With the continued advancement of machine learning and artificial intelligence algorithms, the detection capability of server memory failures is gradually increasing, and real-time detection and immediate response capabilities are becoming more and more important.

In order to better cope with the memory faults and improve the reliability of the server where the memory is located, the faults need to be predicted in advance before the memory faults occur, and corresponding measures are taken to avoid or reduce the influence of the faults on the normal operation of the system. That is, intelligent prediction of memory failure will not only focus on the prediction of failure, but will also perform failure mode analysis and optimization more deeply, and maintenance of memory failure, thereby improving reliability and stability of the server.

In view of the above, how to implement early prediction of a memory failure of a server and maintain a memory to be failed is a problem to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a memory fault prediction maintenance method, device, equipment and medium, so as to predict a server memory fault in advance and maintain a memory to be broken down.

In order to solve the technical problems, the invention provides a memory fault prediction maintenance method which is applied to a server; the method comprises the following steps:

acquiring operation parameters of a target memory, and determining the current theoretical life stage of the target memory;

Inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state;

determining a corresponding dynamic alarm threshold according to the theoretical life stage of the target memory;

Judging whether the predicted life is smaller than the dynamic alarm threshold value or not;

if yes, confirming that the target memory has faults, and determining the fault type of the target memory;

and maintaining the target memory according to the fault type.

In one aspect, the process for constructing the memory life prediction model includes:

Collecting operation parameters of a plurality of memories, and determining the service life of the corresponding memory under each operation parameter; wherein the operation parameters at least comprise a temperature value, a voltage value, a power consumption value and a use condition;

Respectively corresponding the service life of each memory to the operation parameters of each memory to generate a plurality of data processing groups, and dividing each data processing group into a training set and a testing set according to a preset proportion;

Determining a life average value of each data processing group in the training set;

According to the data in each data processing group in the training set and the life average value, determining the sum of squares of the differences of the temperature values between the data processing groups, the sum of squares of the differences of the voltage values between the data processing groups, the sum of squares of the differences of the power consumption values between the data processing groups and the sum of squares of the differences of the use conditions between the data processing groups respectively;

respectively determining a temperature value, a voltage value, a power consumption value and an error square sum of use conditions according to the data in each data processing group in the training set and the life average value;

model training is carried out through wrapped feature selection and recursive feature elimination based on the deviation square sums and the error square sums, so that an initial model is obtained;

Verifying the initial model based on the data in the test set, and judging whether the accuracy of the initial model is greater than a first threshold;

If yes, outputting the initial model to serve as the memory life prediction model.

On the other hand, after obtaining the initial model, the method further comprises:

Triggering error checking and correction alarming of the memory through an intelligent platform management tool to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions of the memory;

Triggering continuous error injection errors and/or disposable error injection errors of the memory through an asymmetric encryption algorithm tool to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions and/or power consumption of the memory;

verifying the prediction accuracy of the initial model on the service life of the memory under the fluctuation of the voltage of the memory by carrying out contact short circuit on the memory;

and verifying the prediction accuracy of the initial model on the service life of the memory at different temperatures of the memory by adjusting the temperature value of the memory.

On the other hand, the theoretical life stage comprises an alarm stage and an emergency stage; wherein, the memory life in the alarm stage is longer than the memory life in the emergency stage; correspondingly, the determining the corresponding dynamic alarm threshold according to the theoretical life stage of the target memory at present includes:

when the theoretical life stage is the alarm stage, acquiring a first average value and a first standard deviation of the life of a plurality of memories in the alarm stage;

Obtaining a first product of the first standard deviation and a first preset coefficient, and adding the first product to the first average value to obtain the dynamic alarm threshold corresponding to the alarm stage;

when the theoretical life stage is the emergency stage, acquiring a second average value and a second standard deviation of the life of a plurality of memories in the emergency stage;

And obtaining a second product of the second standard deviation and a second preset coefficient, and adding the second product and the second average value to obtain the dynamic alarm threshold corresponding to the emergency phase.

On the other hand, after the confirming that the target memory has a fault, the method further comprises:

Outputting memory fault alarm information through a control panel and a management system of the server;

generating a memory fault alarm log based on the management system of the server;

The memory fault alarm log comprises fault grade, fault type, fault position and fault processing opinion of the target memory.

In another aspect, the maintaining the target memory according to the fault type includes:

When the fault type of the target memory is a heat dissipation fault, reducing the temperature value of the environment where the target memory is located;

when the fault type of the target memory is a voltage fluctuation fault, regulating the voltage value of the target memory;

When the fault type of the target memory is a use frequency fault, closing an application program running in the server;

and when the fault type of the target memory is a memory life fault, confirming that the life of the target memory is limited, and replacing the target memory.

In another aspect, the method further comprises:

Monitoring the accuracy of the memory life prediction model;

Judging whether the accuracy of the memory life prediction model is smaller than a second threshold value or not;

If the accuracy of the memory life prediction model is smaller than the second threshold, the operation parameters of the memories are collected again, so that the memory life prediction model is retrained according to the operation parameters of the memories;

Wherein the first threshold is greater than the second threshold.

In order to solve the technical problems, the invention also provides a memory failure prediction maintenance device which is applied to a server; the device comprises:

The acquisition module is used for acquiring the operation parameters of the target memory and determining the current theoretical life stage of the target memory;

the prediction module is used for inputting the operation parameters of the target memory into a memory life prediction model so as to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state;

The first determining module is used for determining a corresponding dynamic alarm threshold according to the theoretical life stage of the target memory;

The judging module is used for judging whether the predicted service life is smaller than the dynamic alarm threshold value; if yes, triggering a second determining module;

The second determining module is used for determining that the target memory has faults and determining the fault type of the target memory;

and the maintenance module is used for maintaining the target memory according to the fault type.

In order to solve the technical problem, the present invention further provides a memory failure prediction maintenance device, including:

A memory for storing a computer program;

and the processor is used for realizing the steps of the memory failure prediction maintenance method when executing the computer program.

In order to solve the above technical problem, the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program implements the steps of the memory failure prediction maintenance method when executed by a processor.

The memory fault prediction maintenance method provided by the invention is applied to a server; the method comprises the steps of specifically obtaining operation parameters of a target memory, and determining the current theoretical life stage of the target memory; inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state; determining a corresponding dynamic alarm threshold according to the current theoretical life stage of the target memory; judging whether the predicted life is smaller than a dynamic alarm threshold value or not; if yes, confirming that the target memory has faults, and determining the fault type of the target memory; and maintaining the target memory according to the fault type.

The method has the advantages that the operation parameters of the target memory are monitored, the operation parameters are input into the memory life prediction model, the predicted life of the target memory can be accurately obtained, and the predicted life is compared with the dynamic alarm threshold; when the predicted life is smaller than the dynamic alarm threshold value, confirming that the target memory has faults; because the dynamic alarm threshold corresponds to the current theoretical life stage of the target memory, and different theoretical life stages correspond to different dynamic alarm thresholds, the scheme can effectively improve the accuracy of memory fault alarm; in addition, after confirming the fault of the target memory, the target memory is further maintained according to the fault type, so that the service life of the memory is prolonged, further damage of the memory caused by the fault can be effectively avoided, and the reliability and durability of the server equipment can be improved to the greatest extent.

In addition, the invention also provides a memory fault prediction maintenance device, equipment and medium, and the effects are the same as above.

Drawings

For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flowchart of a memory failure prediction maintenance method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of a memory failure prediction maintenance device according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of a memory failure prediction maintenance device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

The invention provides a memory fault prediction maintenance method, device, equipment and medium, which are used for predicting a memory fault of a server in advance and maintaining a memory to be failed.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description.

In order to better cope with the memory faults and improve the reliability of the server where the memory is located, the faults need to be predicted in advance before the memory faults occur, and corresponding measures are taken to avoid or reduce the influence of the faults on the normal operation of the system. That is, intelligent prediction of memory failure will not only focus on the prediction of failure, but will also perform failure mode analysis and optimization more deeply, and maintenance of memory failure, thereby improving reliability and stability of the server. Based on the above, the invention provides a memory failure prediction maintenance method, so as to predict the memory failure of the server in advance and maintain the memory to be failed.

Fig. 1 is a flowchart of a memory failure prediction maintenance method according to an embodiment of the present invention. The method is applied to a server; the method comprises the following steps:

S10: and acquiring the operation parameters of the target memory, and determining the current theoretical life stage of the target memory.

In a specific implementation, first, the operating parameters of the target memory are collected. It is understood that the operation parameters of the target memory include at least a temperature value, a voltage value, a power consumption value, and a usage (i.e., a memory usage rate). In the process of collecting operation parameters, the temperature value of the target memory can be collected through a temperature sensor, the voltage value of the target memory can be collected through a voltage sensor, the power consumption value of the target memory can be collected through a power consumption sensor, and the service condition of the target memory can be obtained through a top command under a Linux system.

And further determining the theoretical life stage of the target memory. It is understood that each of the existing production plants has a theoretical life. And dividing the theoretical life based on a preset standard to obtain a plurality of theoretical life stages. For example, the theoretical lifetime of the memory is 10, and the theoretical lifetime is divided into two stages of the first half and the second half by dividing the theoretical lifetime by a median value of 5. In different theoretical life stages, the life of the memory is different; based on the above-mentioned division, the memory life in the first half theoretical life stage is longer than the memory life in the second half theoretical life stage. Therefore, when predicting the memory failure, the theoretical life stage of the target memory is required to be considered.

S11: and inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory.

Further, the operation parameters of the target memory are input into a memory life prediction model to obtain the predicted life of the target memory. It can be understood that the memory life prediction model is a model previously constructed based on the relationship between the memory life and the corresponding running state, and the generation process of the memory life prediction model is not limited in this embodiment, and depends on the specific implementation situation.

S12: and determining a corresponding dynamic alarm threshold according to the current theoretical life stage of the target memory.

From the above, it can be seen that the lifetime of the memory is different in different theoretical lifetime phases. Therefore, in order to accurately judge whether the target memory fails, a corresponding dynamic alarm threshold value needs to be determined according to the current theoretical life stage of the target memory, so that the dynamic alarm threshold value can be compared with the predicted life obtained through the memory life prediction model. It will be appreciated that different theoretical life stage determinations correspond to different dynamic alert thresholds. In this embodiment, the determination manner of the dynamic alarm threshold is not limited, and depends on the specific implementation situation.

S13: judging whether the predicted life is smaller than a dynamic alarm threshold value or not; if yes, the process proceeds to step S14.

S14: and confirming that the target memory has faults, and determining the fault type of the target memory.

S15: and maintaining the target memory according to the fault type.

After the dynamic alarm threshold and the predicted lifetime are obtained, judging whether the predicted lifetime is smaller than the dynamic alarm threshold. If the predicted life is not less than the dynamic alarm threshold, confirming that the target memory has no fault, and ending the process. If the predicted life is smaller than the dynamic alarm threshold, confirming that the target memory has faults, and determining the fault type of the target memory.

It should be noted that there are various reasons for causing the memory failure, such as over-high temperature, under-low temperature, over-high utilization rate, over-high power consumption, or over-high voltage fluctuation. The fault type of the target memory needs to be determined, so that the target memory can be maintained in a targeted manner according to the fault type. In this embodiment, the fault types of the target memory correspondingly include a heat dissipation fault, a voltage fluctuation fault, a frequency of use fault and a memory life fault, and the fault types can be determined after the memory fault is confirmed. In this embodiment, the specific process of maintaining the target memory according to the fault type is not limited, and depends on the specific implementation situation.

In this embodiment, the current theoretical life stage of the target memory is determined by acquiring the operation parameters of the target memory; inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state; determining a corresponding dynamic alarm threshold according to the current theoretical life stage of the target memory; judging whether the predicted life is smaller than a dynamic alarm threshold value or not; if yes, confirming that the target memory has faults, and determining the fault type of the target memory; and maintaining the target memory according to the fault type. The method has the advantages that the operation parameters of the target memory are monitored, the operation parameters are input into the memory life prediction model, the predicted life of the target memory can be accurately obtained, and the predicted life is compared with the dynamic alarm threshold; when the predicted life is smaller than the dynamic alarm threshold value, confirming that the target memory has faults; because the dynamic alarm threshold corresponds to the current theoretical life stage of the target memory, and different theoretical life stages correspond to different dynamic alarm thresholds, the scheme can effectively improve the accuracy of memory fault alarm; in addition, after confirming the fault of the target memory, the target memory is further maintained according to the fault type, so that the service life of the memory is prolonged, further damage of the memory caused by the fault can be effectively avoided, and the reliability and durability of the server equipment can be improved to the greatest extent.

Based on the foregoing embodiments, in some embodiments, the process of constructing the memory life prediction model includes:

S111: and collecting the operation parameters of a plurality of memories, and determining the service life of the corresponding memory under each operation parameter.

The operation parameters at least comprise a temperature value, a voltage value, a power consumption value and a use condition.

S112: and respectively corresponding the service life of each memory to the operation parameters of each memory to generate a plurality of data processing groups, and dividing each data processing group into a training set and a testing set according to a preset proportion.

S113: a life average for each data processing group in the training set is determined.

S114: and respectively determining the sum of squares of the deviations of the temperature values, the sum of the squares of the deviations of the voltage values, the sum of the squares of the deviations of the power consumption values and the sum of the squares of the deviations of the use conditions among the data processing groups according to the data and the life average values in each data processing group in the training set.

S115: and respectively determining a temperature value, a voltage value, a power consumption value and an error square sum of the use condition according to the data in each data processing group in the training set and the service life average value.

S116: model training is performed by wrapped feature selection and recursive feature elimination based on the sum of squares of the deviations and the sum of squares of the errors to obtain an initial model.

S117: verifying the initial model based on the data in the test set, and judging whether the accuracy of the initial model is greater than a first threshold; if yes, the process proceeds to step S118.

S118: the initial model is output as a memory life prediction model.

In order to generate a memory life prediction model, first, a plurality of operating parameters of a memory need to be collected, and the life of a corresponding memory under each operating parameter is determined. It can be understood that the operation parameters at least comprise a temperature value, a voltage value, a power consumption value and a service condition, specifically, the voltage value of the target memory is collected through the voltage sensor, the power consumption value of the target memory is collected through the power consumption sensor, and the service condition of the target memory is obtained through a top command under the Linux system.

Table 1 memory life and operating parameter comparison table

In the voltage parameters of table 1, 1 is a stable voltage, and 2 is a fluctuating voltage; in the power consumption parameter, 100w represents high power consumption, and 50w represents low power consumption. As shown in table 1, the lifetime of each memory is respectively associated with the operation parameters of each memory to generate a plurality of data processing groups, i.e., memory groups. The data in each data processing set will be used for training and validation of the memory life prediction model. And dividing each data processing group into a training set and a testing set according to a preset proportion so as to facilitate the subsequent model training verification. In this embodiment, the preset ratio is not limited, and depends on the specific implementation.

The mean value of the lifetime of each data processing group in the training set is further determined. Taking bank 1 as an example, the lifetime average Y1 of bank 1 is 48. According to the data and life average value in each data processing group in the training set, respectively determining the sum of squares of the differences of the temperature values, the sum of squares of the differences of the voltage values, the sum of squares of the differences of the power consumption values and the sum of squares of the differences of the use conditions between the data processing groups, wherein the method comprises the following specific steps:

Wherein SSA1 is the sum of squares of the deviations of the temperature values between the data processing groups, n _{Temperature (temperature)} i is the temperature value of the i-th data processing group, For the life average of the ith data processing group,/>I is a positive integer, which is a life overall average value in each data processing.

Where SSA2 is the sum of squares of the deviations of the voltage values between the data processing groups, n _{Voltage (V)} i is the voltage value of the i-th data processing group,For the life average of the ith data processing group,/>Is an overall average of the lifetime in each data process.

Where SSA3 is the sum of squares of the differences in power consumption values between the data processing groups, n _{Power consumption} i is the power consumption value of the ith data processing group,For the life average of the ith data processing group,/>Is an overall average of the lifetime in each data process.

Wherein SSA4 is the sum of squares of the deviations of the usage between the data processing groups, n _{Use case} i is the usage of the ith data processing group,For the life average of the ith data processing group,/>Is an overall average of the lifetime in each data process. By recording the sum of squares of the deviations of the operating parameters of the data processing sets, the impact of different features on memory life can be analyzed.

Further, according to the data and the life average value in each data processing group in the training set, respectively determining a temperature value, a voltage value, a power consumption value and an error square sum of service conditions, wherein the error square sum is specifically as follows:

Where SSE is the sum of squares of the errors, YIj is the j-th observation (i.e., temperature value, voltage value, power consumption value, and use case) of the i-th data processing group, Is the life average of the i-th data processing group. The influence condition of different conditions of the same feature on the service life of the memory can be analyzed by recording the square sum of errors.

Model training is further performed through wrapped feature selection and recursive feature elimination based on the sum of squares of the deviations and the sum of squares of the errors to obtain an initial model. It will be appreciated that feature selection is performed by wraparound feature selection (Wrapper Method), which involves repeatedly embedding the captured data into machine learning model training, selecting the best feature subset by repeatedly training the model and evaluating the performance of the feature subset, and operating in conjunction with model training. And simultaneously, repeatedly testing the training model by using a recursive feature elimination (Recursive Feature Elimination, RFE) method, and finding out the optimal data range of each feature affecting the memory life. For example, f= (X1, X2, … …, xn), F is a test result, X is a feature, xn is a feature that has no effect in the test. It can be understood that the characteristic is the operation parameter of the memory, specifically, the temperature value, the power consumption value, the voltage value, etc. For the feature Xn which cannot be influenced, resampling test is needed; and repeatedly testing for a plurality of times to check whether the characteristic can influence the service life of the memory. If one characteristic repeated test cannot influence the service life of the memory, eliminating the characteristic, selecting the rest characteristic as a prediction basis, and training to obtain an initial model.

After the initial model is generated, verifying the initial model based on the data in the test set, and judging whether the accuracy of the initial model is greater than a first threshold. It should be noted that the accuracy refers to the ratio of the number of correctly classified samples to the total number of samples, and the specific formula is as follows:

Accuracy= (tp+tn)/(tp+tn+fp+fn);

Where TP represents a True instance (True Positive), TN represents a True Negative instance (True Negative), FP represents a False Positive instance (False Positive), and FN represents a False Negative instance (FALSE NEGATIVE). In the initial model, the positive category indicates a triggered alarm and the negative category indicates an un-triggered alarm. The true example specifically refers to the fact that the model correctly predicts the memory actually triggering the alarm as triggering the alarm, the true negative example specifically refers to the fact that the model correctly predicts one memory actually not triggering the alarm as not triggering the alarm, the false positive example specifically refers to the fact that the model predicts the memory actually not triggering the alarm as triggering the alarm, and the false negative example specifically refers to the fact that the model predicts the memory actually triggering the alarm as not triggering the alarm.

If the accuracy rate of the initial model is not greater than the first threshold, the accuracy rate of the initial model is considered to be low, and data are required to be collected again for retraining, so that the occurrence times of false positive examples and false negative examples are reduced; and if the accuracy rate of the initial model is confirmed to be larger than the first threshold value, the accuracy rate of the initial model is considered to be qualified, and the initial model is output to serve as a memory life prediction model. In addition, the first threshold is not limited in this embodiment, and depends on the specific implementation.

In conclusion, the generation of the memory life prediction model is realized, and the predicted life of the target memory can be accurately predicted through the memory life prediction model, so that the comparison with a dynamic alarm threshold value can be conveniently carried out later, and whether the target memory fails or not can be determined.

According to the embodiment, when the memory fault prediction model is constructed, the accuracy of the model is mainly verified through the test set data. In the implementation, the accuracy of the model can also be verified by artificially triggering the memory fault. Thus, after obtaining the initial model, the method further comprises:

S16: and triggering error checking and correction alarming of the memory through an intelligent platform management tool to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions of the memory.

S17: and triggering continuous error injection errors and/or disposable error injection errors of the memory through an asymmetric encryption algorithm tool to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions and/or power consumption of the memory.

S18: and verifying the accuracy of prediction of the initial model on the service life of the memory under the fluctuation of the voltage of the memory by carrying out contact short circuit on the memory.

S19: and verifying the prediction accuracy of the initial model on the service life of the memory at different temperatures of the memory by adjusting the temperature value of the memory.

Specifically, an Error CHECKING AND correction (ECC) alarm is triggered by an intelligent platform management tool (ipmitool) to verify the accuracy of prediction of the initial model for the life of the memory under different use conditions of the memory. Wherein, can specifically trigger repairable ECC warning to simulate the service condition of short-time target memory:

05h-Correctable ECC/other correctable memory error logging limit reached；

ipmitool-Ilanplus-Uroot-Proot-H100.2.76.32raw 0x0A 0x44 0x00 0x010x02 0x0 0x0 0x0 0x0 0x21 0x00 0x04 0x0C 0xf9 0x6f 0x05 0x00 0x00;

in addition, an unrepairable ECC alarm can be triggered, so that the service condition of the long-time target memory is simulated:

01h-Uncorrectable ECC/other uncorrectable memory error；

ipmitool-Ilanplus-Uroot-Proot-H100.2.76.32raw 0x0A 0x44 0x00 0x010x02 0x0 0x0 0x0 0x0 0x21 0x00 0x04 0x0C 0xf9 0x6f 0x01 0x00 0x00;

Meanwhile, continuous error injection errors (PERSISTENT ERROR INJECTION) and/or One-time error injection errors (One-shot Error Injection) of the memory can be triggered by an asymmetric encryption algorithm (RAS) tool so as to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions and/or power consumption of the memory. Wherein PERSISTENT ERROR INJECTION allows the diagnostic firmware to inject errors into future dynamic random access memory (Dynamic Random Access Memory, DRAM) writes to trigger high memory access and high power consumption conditions; one-shot Error Injection is used to trigger the low access rate and low power consumption situations of the memory.

Further, the accuracy of prediction of the memory life of the initial model under the fluctuation of the memory voltage is verified by conducting contact short-circuiting on the memory, and the accuracy of prediction of the memory life of the initial model under different temperatures of the memory is verified by adjusting the temperature value of the memory. In addition, training errors of the memory life prediction model can be triggered by destroying the memory golden finger.

Therefore, the accuracy of the memory life prediction model can be verified in an auxiliary mode through the mode of manually triggering the memory faults, so that references are provided for training of the memory life prediction model.

Based on the above embodiments, in some embodiments, the theoretical life stage includes an alarm stage (warning) and an emergency stage (critical); the standard of division is the median of memory life, i.e. assuming a memory life of 10, the first 5 is the warning phase and the second 5 is the critical phase. That is, the memory life in the alert phase is greater than the memory life in the emergency phase.

It should be noted that, as the memory usage time increases, the dynamic alarm threshold in each stage decreases, so that the memory status can be monitored more easily after the memory usage time becomes longer, and the accuracy is improved. Therefore, determining the corresponding dynamic alarm threshold according to the theoretical life stage of the target memory at present includes:

s121: and when the theoretical life stage is an alarm stage, acquiring a first average value and a first standard deviation of the life of a plurality of memories in the alarm stage.

S122: and obtaining a first product of the first standard deviation and a first preset coefficient, and adding the first product and a first average value to obtain a dynamic alarm threshold corresponding to the alarm stage.

S123: and when the theoretical life stage is an emergency stage, acquiring a second average value and a second standard deviation of the life of the memories in the emergency stage.

S124: and obtaining a second product of the second standard deviation and a second preset coefficient, and adding the second product and a second average value to obtain a dynamic alarm threshold corresponding to the emergency phase.

In a specific implementation, in order to determine a dynamic alarm threshold corresponding to an alarm phase, a first average value and a first standard deviation of life of a plurality of memories in the alarm phase are specifically obtained. It will be appreciated that the first average value represents a baseline level of memory life in the alert phase and the first standard deviation represents a degree of dispersion of memory life in the alert phase. Further obtaining a first product of the first standard deviation and a first preset coefficient, and adding the first product and the first average value to obtain a dynamic alarm threshold corresponding to the alarm stage.

In order to determine the dynamic alarm threshold corresponding to the emergency phase, a second average value and a second standard deviation of the service lives of the memories in the emergency phase are specifically obtained. It will be appreciated that the second average value represents a baseline level of memory life during the emergency phase and the second standard deviation represents a degree of dispersion of memory life during the emergency phase. Further obtaining a second product of the second standard deviation and a second preset coefficient, and adding the second product and the second average value to obtain a dynamic alarm threshold corresponding to the emergency phase.

It can be understood that the first preset coefficient and the second preset coefficient are used for controlling the offset degree of the dynamic alarm threshold value relative to the corresponding average value, and can be adjusted according to the actual situation, which is not limited in this embodiment.

In this embodiment, by determining the dynamic alarm threshold corresponding to each theoretical life stage and comparing the dynamic alarm threshold with the predicted life of the target memory, the fault of the target memory can be more accurately judged according to the current theoretical life of the target memory, the judging process adapts to different workloads and environmental conditions, and the reliability of judging the fault of the memory is improved.

In order to better record the memory failure information, in some embodiments, after confirming that the target memory has a failure, the method further includes:

s20: and outputting the memory fault alarm information through a control panel and a management system of the server.

S21: and generating a memory fault alarm log by the management system based on the server.

The memory fault alarm log comprises a fault grade, a fault type, a fault position and a fault processing opinion of the target memory.

Specifically, after determining that the target memory has a fault, the control panel of the server outputs memory fault warning information, and a user can check the memory fault warning information which is warned by the indicator lamp on the front control panel of the server case. Meanwhile, the memory fault alarm information is output through the management system of the server, and a user can check an alarm prompt about the memory fault on the management system of the server.

Further, the management system based on the server generates a memory fault alarm log, and the user can find the memory fault alarm log in an alarm log interface of the management system of the server, wherein the log specifically comprises a fault grade, a fault type, a fault position and a fault processing opinion of the target memory. And the user can maintain the fault target memory according to the fault handling opinion in the memory fault alarm log. Therefore, through outputting the memory fault alarm information and generating the memory fault alarm log, a user can timely acquire the occurred memory fault and check specific fault content.

In order to maintain the failed target memory in a targeted manner, in some embodiments, the maintaining the target memory according to the failure type includes:

s151: and when the fault type of the target memory is a heat dissipation fault, reducing the temperature value of the environment where the target memory is located.

S152: and when the fault type of the target memory is a voltage fluctuation fault, regulating the voltage value of the target memory.

S153: and closing the application program running in the server when the fault type of the target memory is a use frequency fault.

S154: and when the fault type of the target memory is a memory life fault, confirming that the life of the target memory is up to the limit, and replacing the target memory.

Specifically, when the failure type of the target memory is a heat dissipation failure, a maintainer needs to check the temperature of the environment where the target memory is located, and further reduce the temperature value of the environment where the target memory is located, thereby eliminating the heat dissipation failure of the target memory. When the fault type of the target memory is a voltage fluctuation fault, maintenance personnel need to measure the voltage value of the target memory in the server through the ammeter, and further adjust the voltage value of the target memory, so that the voltage fluctuation fault of the target memory is eliminated. When the fault type of the target memory is a use frequency fault, maintenance personnel need to check whether an unclosed program continuously accesses the target memory under the server system, and close an application program running in the server. When the failure type of the target memory is a memory life failure, confirming that the life of the target memory is up to the limit, and timely replacing the memory is needed, so that the occurrence of a shutdown event is avoided, the production interruption of a production line or equipment caused by the failure is reduced, and the production efficiency is improved.

In addition, considering the theoretical life stage of the target memory, when the target memory works in the warning stage, the memory life is in the first half stage, and the alarm should take priority of external environment factors (namely, the temperature value of the memory environment changes or the memory is in error touch, etc.), based on the specific mode of maintaining the target memory according to the fault type, each parameter of the working environment is adjusted normally, and then the memory can continue to work. When the target memory works in the critical stage, the service life of the memory is in the second half stage, and at the moment, the external environment factors are checked firstly when an alarm occurs; based on the specific mode of maintaining the target memory according to the fault type, after the working environment parameters are adjusted to be normal, if an alarm still appears after a period of time, a new memory needs to be adjusted and replaced.

Therefore, the target memory is maintained in a targeted manner according to the fault type of the target memory, and the operation reliability of the server is improved.

On the basis of the above embodiments, in some embodiments, the method further includes:

s22: and monitoring the accuracy of the memory life prediction model.

S23: judging whether the accuracy of the memory life prediction model is smaller than a second threshold value; if yes, the process proceeds to step S24.

S24: re-acquiring the operation parameters of the memories to re-train the memory life prediction model according to the operation parameters of the memories;

Wherein the first threshold is greater than the second threshold.

In a specific implementation, the memory life prediction model is specifically obtained by training according to historical operation parameters of the memory. As the server operates, the life and operating parameters of the memory may change, which may result in inaccurate prediction of the memory life prediction model. Therefore, in order to ensure the accuracy of the memory life prediction model, the accuracy of the memory life prediction model needs to be continuously monitored after the memory life prediction model is obtained.

And judging whether the accuracy of the memory life prediction model is smaller than a second threshold value. And if the accuracy of the memory life prediction model is not less than the second threshold, the accuracy of the current memory life prediction model is qualified. If the accuracy of the memory life prediction model is smaller than the second threshold, the accuracy of the current memory life prediction model is considered to be unqualified, and the model prediction is inaccurate. The operating parameters of the plurality of memories need to be re-collected to re-train the memory life prediction model according to the operating parameters of the plurality of memories. It should be noted that, in this embodiment, the first threshold is greater than the second threshold, which can provide a certain redundancy range for the accuracy monitoring of the memory life prediction model.

In the above embodiments, the detailed description is given to the memory failure prediction maintenance method, and the invention also provides a corresponding embodiment of the memory failure prediction maintenance device.

Fig. 2 is a schematic diagram of a memory failure prediction maintenance device according to an embodiment of the present invention. The device is applied to the server; as shown in fig. 2, the memory failure prediction apparatus includes:

The acquiring module 10 is configured to acquire an operation parameter of the target memory, and determine a theoretical lifetime level of the target memory.

The prediction module 11 is configured to input an operation parameter of the target memory into the memory life prediction model to obtain a predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state.

The first determining module 12 is configured to determine a corresponding dynamic alarm threshold according to a theoretical lifetime stage of the target memory;

a judging module 13, configured to judge whether the predicted lifetime is less than a dynamic alarm threshold; if yes, triggering a second determining module.

The second determining module 14 is configured to confirm that the target memory has a fault, and determine a fault type of the target memory.

And the maintenance module 15 is used for maintaining the target memory according to the fault type.

In some embodiments, the process of constructing the memory life prediction model specifically includes:

Collecting operation parameters of a plurality of memories, and determining the service life of the corresponding memory under each operation parameter; wherein the operation parameters at least comprise a temperature value, a voltage value, a power consumption value and a service condition; respectively corresponding the service life of each memory to the operation parameters of each memory to generate a plurality of data processing groups, and dividing each data processing group into a training set and a testing set according to a preset proportion; determining the life average value of each data processing group in the training set; according to the data and the life average value in each data processing group in the training set, respectively determining the sum of squares of the differences of the temperature values between the data processing groups, the sum of squares of the differences of the voltage values between the data processing groups, the sum of squares of the differences of the power consumption values between the data processing groups and the sum of squares of the differences of the use conditions between the data processing groups; respectively determining a temperature value, a voltage value, a power consumption value and an error square sum of service conditions according to data in each data processing group in the training set and the service life average value; model training is carried out through wrapped feature selection and recursive feature elimination based on each deviation square sum and each error square sum so as to obtain an initial model; verifying the initial model based on the data in the test set, and judging whether the accuracy of the initial model is greater than a first threshold; if yes, outputting the initial model to be used as a memory life prediction model.

In some embodiments, further comprising:

the first triggering sub-module is used for triggering error checking and correction alarming of the memory through the intelligent platform management tool so as to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions of the memory;

The second triggering sub-module is used for triggering continuous error injection errors and/or disposable error injection errors of the memory through the asymmetric encryption algorithm tool so as to verify the prediction accuracy of the initial model on the service life of the memory under different use conditions and/or power consumption of the memory;

The third triggering sub-module is used for verifying the prediction accuracy of the initial model on the service life of the memory under the fluctuation of the voltage of the memory through the contact short circuit of the memory;

And the fourth triggering sub-module is used for verifying the prediction accuracy of the initial model on the service life of the memory at different temperatures of the memory by adjusting the temperature value of the memory.

In some embodiments, the theoretical life stage includes an alert stage and an emergency stage; wherein, the memory life in the alarm stage is longer than the memory life in the emergency stage; correspondingly, the first determining module 12 comprises:

The first acquisition submodule is used for acquiring a first average value and a first standard deviation of the service lives of a plurality of memories in an alarm stage when the theoretical service life stage is the alarm stage;

The first summation submodule is used for obtaining a first product of the first standard deviation and a first preset coefficient, and summing the first product with a first average value to obtain a dynamic alarm threshold corresponding to an alarm stage;

the second obtaining submodule is used for obtaining a second average value and a second standard deviation of the service lives of the memories in the emergency phase when the theoretical service life phase is the emergency phase;

and the second summation sub-module is used for obtaining a second product of the second standard deviation and a second preset coefficient, and summing the second product and a second average value to obtain a dynamic alarm threshold corresponding to the emergency phase.

In some embodiments, further comprising:

the first output sub-module is used for outputting memory fault alarm information through a control panel and a management system of the server;

the first generation sub-module is used for generating a memory fault alarm log based on a management system of the server;

In some embodiments, maintenance module 15 includes:

the first maintenance submodule is used for reducing the temperature value of the environment where the target memory is located when the fault type of the target memory is a heat radiation fault;

The second maintenance submodule is used for adjusting the voltage value of the target memory when the fault type of the target memory is a voltage fluctuation fault;

The third dimension protection sub-module is used for closing an application program running in the server when the fault type of the target memory is a use frequency fault;

and the fourth maintenance submodule is used for confirming the service life limit of the target memory and replacing the target memory when the fault type of the target memory is a memory service life fault.

In some embodiments, further comprising:

The first monitoring submodule is used for monitoring the accuracy of the memory life prediction model;

The first judging sub-module is used for judging whether the accuracy of the memory life prediction model is smaller than a second threshold value or not; if yes, triggering a training sub-module;

the training sub-module is used for re-collecting the operation parameters of the memories so as to re-train the memory life prediction model according to the operation parameters of the memories;

Wherein the first threshold is greater than the second threshold.

In this embodiment, the memory failure prediction maintenance device includes an acquisition module, a prediction module, a first determination module, a judgment module, a second determination module, and a maintenance module. The memory failure prediction maintenance device can realize all the steps of the memory failure prediction maintenance method when in operation. The method comprises the steps of obtaining operation parameters of a target memory and determining the current theoretical life stage of the target memory; inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state; determining a corresponding dynamic alarm threshold according to the current theoretical life stage of the target memory; judging whether the predicted life is smaller than a dynamic alarm threshold value or not; if yes, confirming that the target memory has faults, and determining the fault type of the target memory; and maintaining the target memory according to the fault type. The method has the advantages that the operation parameters of the target memory are monitored, the operation parameters are input into the memory life prediction model, the predicted life of the target memory can be accurately obtained, and the predicted life is compared with the dynamic alarm threshold; when the predicted life is smaller than the dynamic alarm threshold value, confirming that the target memory has faults; because the dynamic alarm threshold corresponds to the current theoretical life stage of the target memory, and different theoretical life stages correspond to different dynamic alarm thresholds, the scheme can effectively improve the accuracy of memory fault alarm; in addition, after confirming the fault of the target memory, the target memory is further maintained according to the fault type, so that the service life of the memory is prolonged, further damage of the memory caused by the fault can be effectively avoided, and the reliability and durability of the server equipment can be improved to the greatest extent.

Fig. 3 is a schematic diagram of a memory failure prediction maintenance device according to an embodiment of the present invention. As shown in fig. 3, the memory failure prediction maintenance apparatus includes:

A memory 20 for storing a computer program.

A processor 21 for implementing the steps of the memory failure prediction maintenance method as mentioned in the above embodiments when executing a computer program.

The memory failure prediction maintenance device provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The Processor 21 may be implemented in at least one hardware form of a digital signal Processor (DIGITAL SIGNAL Processor, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 21 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may integrate a graphics processor (Graphics Processing Unit, GPU) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor for processing computing operations related to machine learning.

Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, where the computer program, after being loaded and executed by the processor 21, can implement the relevant steps of the memory failure prediction maintenance method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. Operating system 202 may include Windows, unix, linux, among other things. Data 203 may include, but is not limited to, data related to a memory failure prediction maintenance method.

In some embodiments, the memory failure prediction maintenance device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is not limiting of the memory failure prediction maintenance method and may include more or fewer components than illustrated.

In this embodiment, the memory failure prediction maintenance device includes a memory and a processor. The memory is used for storing a computer program. The processor is configured to implement the steps of the memory failure prediction maintenance method as mentioned in the above embodiments when executing the computer program. The method comprises the steps of obtaining operation parameters of a target memory and determining the current theoretical life stage of the target memory; inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state; determining a corresponding dynamic alarm threshold according to the current theoretical life stage of the target memory; judging whether the predicted life is smaller than a dynamic alarm threshold value or not; if yes, confirming that the target memory has faults, and determining the fault type of the target memory; and maintaining the target memory according to the fault type. The method has the advantages that the operation parameters of the target memory are monitored, the operation parameters are input into the memory life prediction model, the predicted life of the target memory can be accurately obtained, and the predicted life is compared with the dynamic alarm threshold; when the predicted life is smaller than the dynamic alarm threshold value, confirming that the target memory has faults; because the dynamic alarm threshold corresponds to the current theoretical life stage of the target memory, and different theoretical life stages correspond to different dynamic alarm thresholds, the scheme can effectively improve the accuracy of memory fault alarm; in addition, after confirming the fault of the target memory, the target memory is further maintained according to the fault type, so that the service life of the memory is prolonged, further damage of the memory caused by the fault can be effectively avoided, and the reliability and durability of the server equipment can be improved to the greatest extent.

Finally, the invention also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps as described in the method embodiments above.

It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In this embodiment, a computer program is stored on a computer readable storage medium, and when the computer program is executed by a processor, the steps described in the above method embodiments are implemented. The method comprises the steps of obtaining operation parameters of a target memory and determining the current theoretical life stage of the target memory; inputting the operation parameters of the target memory into a memory life prediction model to obtain the predicted life of the target memory; the memory life prediction model is a model which is constructed in advance based on the relation between the memory life and the corresponding running state; determining a corresponding dynamic alarm threshold according to the current theoretical life stage of the target memory; judging whether the predicted life is smaller than a dynamic alarm threshold value or not; if yes, confirming that the target memory has faults, and determining the fault type of the target memory; and maintaining the target memory according to the fault type. The method has the advantages that the operation parameters of the target memory are monitored, the operation parameters are input into the memory life prediction model, the predicted life of the target memory can be accurately obtained, and the predicted life is compared with the dynamic alarm threshold; when the predicted life is smaller than the dynamic alarm threshold value, confirming that the target memory has faults; because the dynamic alarm threshold corresponds to the current theoretical life stage of the target memory, and different theoretical life stages correspond to different dynamic alarm thresholds, the scheme can effectively improve the accuracy of memory fault alarm; in addition, after confirming the fault of the target memory, the target memory is further maintained according to the fault type, so that the service life of the memory is prolonged, further damage of the memory caused by the fault can be effectively avoided, and the reliability and durability of the server equipment can be improved to the greatest extent.

The method, the device, the equipment and the medium for predicting and maintaining the memory faults are described in detail. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. The memory fault prediction maintenance method is characterized by being applied to a server; the method comprises the following steps:

and maintaining the target memory according to the fault type.

2. The memory failure prediction maintenance method according to claim 1, wherein the process of constructing the memory life prediction model includes:

3. The memory failure prediction maintenance method according to claim 2, further comprising, after obtaining the initial model:

4. The memory failure prediction maintenance method according to claim 1, wherein the theoretical life stage includes an alarm stage and an emergency stage; wherein, the memory life in the alarm stage is longer than the memory life in the emergency stage; correspondingly, the determining the corresponding dynamic alarm threshold according to the theoretical life stage of the target memory at present includes:

5. The memory failure prediction maintenance method according to claim 1, further comprising, after the confirming that the target memory has failed:

6. The memory failure prediction maintenance method according to any one of claims 1 to 5, wherein the maintaining the target memory according to the failure type includes:

7. The memory failure prediction maintenance method according to claim 2, further comprising:

Monitoring the accuracy of the memory life prediction model;

Wherein the first threshold is greater than the second threshold.

8. The memory fault prediction maintenance device is characterized by being applied to a server; the device comprises:

9. A memory failure prediction maintenance device, comprising:

A memory for storing a computer program;

A processor for implementing the steps of the memory failure prediction maintenance method according to any one of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the memory failure prediction maintenance method according to any of claims 1 to 7.