CN116680112B - Memory state detection method, device, communication equipment and storage medium - Google Patents

Memory state detection method, device, communication equipment and storage medium Download PDF

Info

Publication number
CN116680112B
CN116680112B CN202310935420.8A CN202310935420A CN116680112B CN 116680112 B CN116680112 B CN 116680112B CN 202310935420 A CN202310935420 A CN 202310935420A CN 116680112 B CN116680112 B CN 116680112B
Authority
CN
China
Prior art keywords
memory
data
health
error
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310935420.8A
Other languages
Chinese (zh)
Other versions
CN116680112A (en
Inventor
李盛新
李道童
贾帅帅
陈衍东
韩红瑞
艾山彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310935420.8A priority Critical patent/CN116680112B/en
Publication of CN116680112A publication Critical patent/CN116680112A/en
Application granted granted Critical
Publication of CN116680112B publication Critical patent/CN116680112B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a memory state detection method, a device, communication equipment and a storage medium, comprising the following steps: acquiring memory data; dividing the memory data into first memory data and second memory data according to the type of the memory data; determining a preliminary memory health score according to a preset memory health evaluation model and first memory data, and processing second memory data through an input/output model to determine health influence factors, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data; and determining the memory state according to the preliminary memory health score and the health influence factor. The memory data is divided into two types according to the influence of the memory data on the memory, and the memory health degree score is adjusted through the health degree influence factor, so that the memory health condition can be effectively and accurately detected.

Description

Memory state detection method, device, communication equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for detecting a memory state, a communication device, and a storage medium.
Background
In modern large data centers, there are typically millions of servers working in concert to provide high performance computing and large data storage services. Due to the large number of tasks running on these servers, hardware failures may have a tremendous impact on the reliability, availability and serviceability (Reliability Availability Serviceability, RAS) of the servers. In the server system, the memory is also called an internal memory, and functions to temporarily store operation data in the CPU and data exchanged with an external memory such as a hard disk. All programs in the computer are run in the memory, so that the memory abnormality has a great influence on the computer. At the same time, memory failure is also one of the most common threats to hardware. To prevent memory corruption, the server typically equips the memory with advanced (Error Correction Code, ECC) mechanisms, such as SEC-DED and ChipKill. However, relying solely on ECC to guarantee memory reliability is far from adequate. In modern data centers, memory failures have proven to be a major cause of server downtime or system failures. With the increasing computation density and memory capacity, a higher risk of memory failure is brought.
In the related art, according to the detected number of memory correctable errors (Correctable Error, CE), frequency of occurrence, operation temperature, number of plugging times, power consumption and other factors, a related technician intends to build a model through a certain memory artificial intelligence algorithm, calculate to obtain the memory health degree, for example, an artificial intelligence algorithm based on machine learning and deep learning, however, in the above scheme, the memory related information is input to the artificial intelligence algorithm as the same dimension, so that the problems of increasing the calculated amount of model training and unstable model building are caused, and the current memory state cannot be accurately fed back.
Disclosure of Invention
The embodiment of the application aims to provide a memory state detection method, a device, communication equipment and a storage medium, which are used for solving the technical problem that the current memory state cannot be fed back accurately in the prior art. The specific technical scheme is as follows:
in a first aspect of the present application, there is provided a memory state detection method, including:
acquiring memory data;
dividing the memory data into first memory data and second memory data according to the type of the memory data;
Determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model, and determining a health influence factor, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data;
and determining the current memory state according to the preliminary memory health score and the health influence factor.
Optionally, the second memory data includes actual input data and actual output data, where the actual input data and the actual output data are in one-to-one correspondence; processing the second memory data through an input/output model, wherein determining the health degree influence factor comprises:
inputting actual input data into an input-output model to obtain predicted output data corresponding to the actual input data;
comparing the predicted output data with the actual output data to obtain an error value;
and determining a health degree influence factor according to the error value.
Optionally, the actual input data includes at least one of average voltage, average frequency during running and average erasing speed data, the actual output data includes memory average temperature data, and before the step of inputting the actual input data into the input-output model to obtain predicted output data corresponding to the actual input data, the method includes:
Preprocessing actual input data acquired in a plurality of preset time periods to generate corresponding data sets in the preset time periods;
screening target data sets of which the memory state is in a normal state in three preset time periods corresponding to the current moment from the data sets corresponding to the preset time periods;
and carrying out normalization processing on the target data set to obtain a training sample.
Optionally, after the step of normalizing the target data set to obtain a training sample, the method includes:
and training a preset initial model according to the training sample to obtain an input and output model.
Optionally, the determining the health-impact factor according to the error value includes:
setting the health-degree influence factor to 1 in the case that the error value is detected to be smaller than a target error threshold value;
and under the condition that the error value is detected to be larger than the target error threshold value, acquiring a preset memory health degree strategy, and determining the health degree influence factor according to the memory health degree strategy.
Optionally, the target error threshold is determined according to a root mean square error corresponding to a target mean square error vector between the predicted output data and the actual output data.
Optionally, the determining the current memory state according to the preliminary memory health score and the health impact factor includes:
and determining a target memory health degree score corresponding to a memory at the current moment according to the preliminary memory health degree score, the health degree influence factor at the current moment and the health degree influence factor in the last preset time period corresponding to the current moment.
Optionally, the target memory health score is generated by the following formula:
wherein, in the above formula,for representing the target memory health score, < >>For representing said preliminary memory health score, < ->For representing the health degree shadow of the current momentResponse factor, tight>For representing the health degree influence factor in the last preset time period corresponding to the current moment,/for the time of the current moment>Mean value representing the correspondence of error values between a plurality of said predicted output data and the actual output data,/->For representing a target error threshold.
Optionally, after the step of determining the target memory health score corresponding to the memory at the current time according to the preliminary memory health score, the health influence factor at the current time, and the health influence factor in the last preset time period corresponding to the current time, the method includes:
And sending early warning information to a user under the condition that the target memory health degree score is detected to be smaller than a target health degree threshold value or the health degree influence factor is smaller than or larger than 1, so that the user can check the memory state at the current moment through a preset display interface.
Optionally, the first memory data is obtained through a register log, and the first memory data includes at least one of the following: memory hard failure, number of memory errors, memory error type, and enabling error repair operations.
Optionally, the preset memory health evaluation model includes:
deducting a first preset score under the condition that a preset number of memory hard faults are detected;
deducting a second preset score when the number of memory errors in the preset time is detected to be larger than a preset threshold value;
determining a third preset score corresponding to the memory error type according to the memory error type;
and determining a fourth preset score corresponding to the enabling error repair operation according to the enabling error repair operation.
Optionally, the memory error type includes at least one of: memory hard errors, memory soft errors, SRAO errors, UCNA errors, SRAR errors, and burst fatal errors.
Optionally, the determining, according to the memory error type, a third preset score corresponding to the memory error type includes:
deducting the third preset score under the condition that the memory error type is detected to be one of a memory hard error, a memory soft error, an SRAO error, a UCNA error and an SRAR error;
and under the condition that the memory error type is detected to be a sudden fatal error, the preliminary memory health score is 0.
Optionally, the enabling error repair operation includes at least one of: consuming PCLS, enabling a Bank level ADDC function, enabling a Rank level ADDC function, and enabling a storm suppression function.
Optionally, the score order corresponding to the enabling error repair operation is from small to large: consuming PCLS, enabling a Bank level ADDC function, enabling a Rank level ADDC function, and enabling a storm suppression function.
Optionally, the data set corresponding to the input-output model includes four columns of vector data, and the input-output model includes three input data corresponding to one output data.
Optionally, the three input data include average voltage, average running frequency and average erasing speed data corresponding to the memory, and the output data include average temperature data corresponding to the memory.
In still another aspect of the present application, there is further provided a memory state detection apparatus, including:
the acquisition module is used for acquiring the memory data;
the dividing module is used for dividing the memory data into first memory data and second memory data according to the type of the memory data;
the first determining module is used for determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model and determining health influence factors, wherein the input/output model is generated by training a preset initial model based on second historical memory data;
and the second determining module is used for determining the current memory state according to the preliminary memory health score and the health influence factor.
In yet another aspect of the present application, there is also provided a communication device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing any one of the memory state detection methods when executing the programs stored in the memory.
In yet another aspect of the present application, there is also provided a computer readable storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform any of the above-described memory state detection methods.
In yet another aspect of the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the memory state detection methods described above.
The memory state detection method provided by the embodiment of the application obtains the memory data; dividing the memory data into first memory data and second memory data according to the type of the memory data; determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model, and determining a health influence factor, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data; and determining the current memory state according to the preliminary memory health score and the health influence factor. The memory data corresponding to different influencing factors are divided into two types, namely first memory data and second memory data by analyzing various factors influencing the healthy operation of the memory of the server, wherein first historical memory data corresponding to first memory data can generate a preset memory health evaluation model, and the first memory data can determine a preliminary memory health score corresponding to a memory according to the preset memory health evaluation model; secondly, the second historical memory data can be processed to train a preset initial model so as to generate an input/output model, the health degree influence factor of the memory can be determined after the second memory data is input into the trained input/output model, and then the current memory state can be determined according to the preliminary memory health degree score and the health degree influence factor.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a flowchart illustrating steps of a memory state detection method according to an embodiment of the present application;
FIG. 2 is a flowchart showing a second step of the memory state detection method according to the embodiment of the present application;
fig. 3 shows a flowchart of a step of a memory state detection method according to an embodiment of the present application;
fig. 4 is a flowchart showing a step of a memory state detection method according to an embodiment of the present application;
fig. 5 shows a device block diagram of a memory state detection device according to an embodiment of the present application;
fig. 6 is a block diagram of a communication device according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an input/output model training method according to an embodiment of the present application;
FIG. 8 shows a schematic diagram of a three-input single-output ANFIS structure provided by an embodiment of the present application;
FIG. 9 is a schematic diagram illustrating an initial FIS network structure generation method according to an embodiment of the present application;
fig. 10 shows a schematic diagram of a preset display interface according to an embodiment of the present application;
Fig. 11 is a schematic diagram of another preset display interface according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present application, and the embodiments can be mutually combined and referred to without contradiction.
Referring to fig. 1, a first step flowchart of a memory state detection method provided by an embodiment of the present application is shown, where the method may include:
step 101, obtaining memory data;
in the embodiment of the present application, the root cause of the memory error is referred to as a fault, the irreversible damage in the hardware is referred to as a hard fault, and the observed fault symptoms are referred to as an error. Errors caused by hard faults, known as hard errors, are persistent and recurring in nature and cannot be resolved over time or by a system reset or restart. Correspondingly, soft faults and soft errors are random transients, such as bit flipping caused by particle collisions. The number of hard faults of the memory refers to the number of Cell units with hard faults in the same memory bank, and more specifically is described as the number of Cell units with errors of two times or more; the error frequency refers to the ratio of all memory errors detected in a period of time to the period of time, wherein all memory errors comprise soft errors and repeated hard errors; error types refer to different types of memory errors, such as select process (SW Recoverable Action Optional, SRAO) errors, do not need to process (Uncorrected No Action, UCNA) errors, and must process (SW Recoverable Action Required, SRAR) errors; repair operations refer to the enablement of certain memory RAS techniques, such as the Partial Cache-line Sparing (PCLS) technique, the adaptive dual DRAM device correction (Adaptive Double DRAM Device Correction, ADDC) technique, and the like.
Therefore, the memory data in the embodiment of the present application may include all relevant information of the memory that may affect the memory state.
Step 102, dividing the memory data into first memory data and second memory data according to the type of the memory data;
it should be noted that, after the memory data is acquired in step 101, the influence factors may be classified into two types by analyzing various factors that influence the healthy operation of the memory of the server.
The first type is a direct factor that can intuitively reflect the health of the memory, so that the memory data corresponding to the influencing factor is divided into first memory data, such as the number of hard faults, the error frequency, the error type, the error repair operation, and the like, and the first memory data can be obtained through a register log.
The second type is that indirect factors capable of reflecting memory abnormality laterally, so that memory data corresponding to the factors are divided into second memory data, for example, the measured average temperature, average voltage, average running frequency and average erasing speed of the memory in a certain time period, the memory health is not necessarily affected directly when the indirect factors are abnormal, that is, the occurrence of memory errors is not necessarily caused directly, but the abnormality of the data can early warn the abnormality of the memory state to a certain extent, and the second memory data can be obtained through the measurement of a sensor.
Step 103, determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, and processing the second memory data through an input/output model to determine health influence factors, wherein the preset memory health evaluation model is determined based on the first historical memory data, and the input/output model is generated by training a preset initial model based on the second historical memory data.
It should be noted that, in the present application, after the memory data is divided, different processing may be performed for different types of memory data.
Specifically, for the first memory data, a preset memory health evaluation model may be generated according to the characteristics corresponding to the first memory data, that is, the direct factor information affecting the memory state.
The predetermined memory health degree evaluation model can determine the score to be deducted when a certain situation occurs according to the importance of the first memory data, that is, the predetermined memory health degree evaluation model adopts a deduction strategy, and the score interval is set to 0 score to 100 score when the memory error is detected to be reduced according to the corresponding rule. A higher score indicates that the memory bank is healthier, and a lower score indicates that the memory bank is less at risk of failure.
Specifically, as shown in the following table 1, table 1 is an exemplary memory state framework corresponding to different memory health scores.
Table 1 an exemplary memory state framework corresponding to different memory health scores
It should be noted that, the first memory data is obtained through a register log, and the first memory data includes at least one of the following: memory hard failure, number of memory errors, memory error type, and enabling error repair operations.
The preset memory health evaluation model comprises the following steps: deducting a first preset score under the condition that a preset number of memory hard faults are detected; deducting a second preset score when the number of memory errors in the preset time is detected to be larger than a preset threshold value; determining a third preset score corresponding to the memory error type according to the memory error type; and determining a fourth preset score corresponding to the enabling error repair operation according to the enabling error repair operation.
Further, the memory error type includes at least one of: memory hard errors, memory soft errors, SRAO errors, UCNA errors, SRAR errors, and burst fatal errors.
The determining a third preset score corresponding to the memory error type according to the memory error type includes: deducting the third preset score under the condition that the memory error type is detected to be one of a memory hard error, a memory soft error, an SRAO error, a UCNA error and an SRAR error; and under the condition that the memory error type is detected to be a sudden fatal error, the preliminary memory health score is 0.
Further, the enabling error repair operation includes at least one of: consuming PCLS, enabling a Bank level ADDC function, enabling a Rank level ADDC function, and enabling a storm suppression function.
The score order corresponding to the error repair enabling operation is from small to large: consuming PCLS, enabling a Bank level ADDC function, enabling a Rank level ADDC function, and enabling a storm suppression function.
It should be noted that, in the embodiment of the present application, the preset memory health evaluation model corresponds to different deduction strategies according to different first memory data.
Specifically, in the first method, the number of memory hard faults is deducted according to a rule, and the memory hard errors are persistent repeatability errors caused by the memory hard faults and are one of main factors for leading the memory faults. Typically the server RAS policy will not be left to repeat all the time, and hard-failed Cell units will be handled by enabling quarantine or repair policies. Memory soft failures and soft errors are usually transient, and in general, cell units with soft errors are not handled in time and errors are not repeated. Therefore, this section only considers the influence of the hard fault on the physical hardware of the memory bank, and in the embodiment of the present application, for example, the deduction rule is set as follows: every 10 hard faults are detected, 1 point is deducted, and less than 10 hard faults are deducted according to 1 point.
Secondly, the frequency deduction rule of the memory error is used, when the number of errors of the memory strips is excessive in a certain time period, on one hand, a large number of memory errors can cause frequent interruption information working to interfere with the normal operation of the system; on the other hand, the high frequency of memory errors also maps memory stripe anomalies to some extent. It should be noted that, the deduction rule set in this section mainly considers the mapping relationship of the memory error frequently occurring in a short time to the memory exception condition, and the set threshold should not exceed the memory error storm suppression threshold. In the embodiment of the present application, for example, the deduction rule is set as follows: the number of memory errors detected per 60 seconds exceeds 500, 1 point is deducted.
Third, the memory error type is a rule for deduction, and the memory error can be classified into a correctable error and an uncorrectable error. Further, correctable errors and uncorrectable errors are subdivided. The memory correctable errors can be classified into patrol correctable errors, read-write correctable errors, mirror image write-back failure errors and the like, and in the application, only the correctable errors are classified into memory hard errors and soft errors to be respectively processed. The uncorrectable errors can be classified into (1) uncorrectable errors of the selection process-SRAO errors, the error codes of the SRAO errors are: error_type UCE; MSCODE 0x0010; (2) uncorrectable errors that do not require processing-UCNA errors, the error codes of UCNA errors are: error_type UCE; MSCODE 0x0101; (3) uncorrectable errors that must be handled-SRAR errors, the error codes of SRAR errors are: error_type UCE; MSCODE 0x0010. (4) Because the physical hardware error of the memory bank causes uncorrectable error of system downtime-burst fatal error, the error code of the burst fatal error is: error_type: UCE. In the embodiment of the present application, for example, the deduction rule is set as follows: UCNA, SRAO, SRAR errors occur once for 5 minutes, and the memory health score is reduced to 0 when a sudden fatal error occurs.
It should be noted that, besides the above UCNA, SRAO, SRAR error types, the memory error types also include some specific errors, such as a row error, a column error, a Bank error, etc. according to the memory structure division, these error types are defined according to the code rules set by the RAS technician of the relevant memory, and the defined results of different code rules are different, so that the set deduction rules are also different.
Fourth, in the memory RAS technique, if a hard memory fault is detected, the OS layer is generally enabled to perform a memory page isolation technique to isolate the memory page where the current error is located, and the isolated memory page does not have any read/write operation. When some memory pages cannot be isolated due to some reasons, the memory repair technology is enabled to repair hard errors, and the repair technology refers to a fault redundancy replacement mechanism of a processor or a memory, such as PCLS, ADDC and the like. Typically, repair resources are limited and some repair mechanisms affect memory performance. Each memory channel of the PCLS resource can be called for 16 times at most, and each cache line can be called for one time at most; the ADDCs are advanced memory RAS functions, each memory channel can simultaneously support two groups of VLS registers, the energy level is divided into a Bank level and a Rank level, when the Rank level ADDCs are enabled, the memory performance loss is about 25% -30%, the accumulated number of times of occurrence of memory errors is reflected to a certain extent, the consumption of the residual repair resources of the memory RAS technology is completely consumed, and the repair resource early warning is carried out to a user. When a large number of errors which cannot be processed through the memory isolation and repair technology appear in the memory instantly, the number of errors reaches a certain storm threshold (different server manufacturers have different settings according to the RAS strategies), an SMI storm suppression function is enabled, namely, the BIOS layer is prevented from reporting a large amount of system management interrupt (System Management Interrupt, SMI) information to the OS layer, the normal operation of the system is influenced, the memory page isolation function is controlled to be invalid, and the fact that the memory bank is no longer healthy is indicated, and the memory bank needs to be replaced is indicated. In the embodiment of the present application, for example, the deduction rule is set as follows: setting a PCLS 1 score consumed once, a Bank grade ADDC function score enabled 10 score, a Rank grade ADDC function enabled 20 score and a storm suppression function enabled 50 score.
Specifically, for the second memory data, the characteristic of the second memory data is indirect factor information that has an indirect influence on the memory state, so in the embodiment of the application, by training the second memory data, an input/output model is obtained, and an influence factor, namely, a health influence factor, can be obtained according to the input/output model.
The health factor may be one of the conditions for determining the memory status finally.
In addition, the training process of the input/output model refers to the following content.
And 104, determining the current memory state according to the preliminary memory health score and the health influence factor.
It should be noted that, after determining the preliminary memory health score and the health impact factor obtained through the input/output model in step 103, the current memory state may be determined according to the above data.
Specifically, as described above, the factors that can map the health of the memory are classified into direct factors and indirect factors, the direct factors are used to establish a preliminary health score policy, and the indirect factors are used as one of the ways to evaluate the memory abnormality. Meanwhile, an input and output model is built through indirect factors, so that health degree influence factors are set, and the memory health degree score is further corrected. Therefore, after the output result obtained based on the input/output model is compared with the measured data, if the comparison error is smaller than the set threshold value, no abnormality exists in the feedback memory. In this case, the indirect factors do not have an effect on the preliminary fitness score strategy established by the direct factors. If the feedback result is a memory anomaly, which is obtained by an indirect factor, its impact on the memory health score must be negative.
Thus, the final memory health score may be determined by a combination of the preliminary memory health score and the health impact factor to determine the current memory state based on, for example, the contents of table 1.
The memory state detection method provided by the embodiment of the application obtains the memory data; dividing the memory data into first memory data and second memory data according to the type of the memory data; determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model, and determining a health influence factor, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data; and determining the current memory state according to the preliminary memory health score and the health influence factor. The memory data corresponding to different influencing factors are divided into two types, namely first memory data and second memory data by analyzing various factors influencing the healthy operation of the memory of the server, wherein first historical memory data corresponding to first memory data can generate a preset memory health evaluation model, and the first memory data can determine a preliminary memory health score corresponding to a memory according to the preset memory health evaluation model; secondly, the second historical memory data can be processed to train a preset initial model so as to generate an input/output model, the health degree influence factor of the memory can be determined after the second memory data is input into the trained input/output model, and then the current memory state can be determined according to the preliminary memory health degree score and the health degree influence factor.
Referring to fig. 2, a second step flowchart of a memory state detection method provided by an embodiment of the present application is shown, where the method may include:
step 201, obtaining memory data;
step 202, dividing the memory data into first memory data and second memory data according to the type of the memory data;
it should be noted that, the steps 201 to 202 are discussed with reference to the foregoing, and are not repeated herein.
Step 203, determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, inputting actual input data into an input/output model to obtain predicted output data corresponding to the actual input data, comparing the predicted output data with the actual output data to obtain an error value, and determining a health influence factor according to the error value, wherein the preset memory health evaluation model is determined based on first historical memory data, the input/output model is generated by training a preset initial model based on second historical memory data, the second memory data comprises actual input data and actual output data, and the actual input data and the actual output data are in one-to-one correspondence;
Further, for the input/output model, training is required to obtain the preset initial model, so, before step 203, referring to fig. 4, fig. 4 may include the following steps:
step 001, preprocessing actual input data acquired in a plurality of preset time periods to generate corresponding data sets in the plurality of preset time periods;
step 002, screening the target data sets in which the memory state is in a normal state in three preset time periods corresponding to the current time from the data sets corresponding to the preset time periods;
and step 003, carrying out normalization processing on the target data set to obtain a training sample.
And step 004, training a preset initial model according to the training sample to obtain an input/output model.
It should be noted that, in the embodiment of the present application, the processing portion of the second memory data is mainly described, and the characteristics of the second memory data are indirect factor information that has an indirect influence on the memory state, so in the embodiment of the present application, an input/output model is obtained by training the second memory data, and an influence factor, that is, a health influence factor, can be obtained according to the input/output model.
As shown in FIG. 7, FIG. 7 shows the acquisition of training samples of the input/output model and the specific flow of training, it can be seen that, inIn the time period, the actually measured memory temperature, voltage, running time frequency and erasing speed information are actively collected every 1s, and an average value is calculated every ten data to obtain indirect factor information data set 0. The total indirect factor data set is composed of the indirect factor information data set 0. Wherein->For the collection time period, 1 hour, 5 hours, 24 hours may be set.
Further, the data set corresponding to the input/output model includes four columns of vector data, and the input/output model includes three input data corresponding to one output data.
Further, the three input data include average voltage, average running frequency and average erasing speed data corresponding to the memory, and the output data include average memory temperature data.
The total indirect factor data set includes four rows of vector data, wherein the first three rows can be average voltage data, average running frequency data and average erasing speed data corresponding to the memory, the last row is average temperature data corresponding to the memory, and each data vector is normalized in 0-1 interval to obtain a training sample. The first three columns in the training sample are used as input information of the training input-output model, and the last column is used as output information of the training input-output model.
Therefore, after the input/output model is trained, the acquired second memory data is input into the input/output model, wherein the second memory data comprises actual input data and actual output data, the actual input data and the actual output data are in one-to-one correspondence, and the actual input data comprises at least one of the following average voltage, running average frequency and average erasing speed data, and the actual output data comprises memory average temperature data.
Specifically, the actual input data is input to the input-output model by the actual input, the predicted output data corresponding to the actual input data is obtained, the predicted output data and the actual output data are compared, an error value can be obtained, and the health degree influence factor can be determined according to the error value.
At the position ofAt this time, the Takagi-Sugeno type FIS structure can be generated by a grid segmentation method based on the training samples, and initial values of membership function parameters are determined, and specifically, fig. 9 may be referred to, and fig. 9 is a schematic diagram of an initial FIS network structure generated by using a grid segmentation method.
It should be noted that, in the embodiment of the present application, the preset initial model may be an adaptive fuzzy neural network ANFIS, where the ANFIS is a multiple-input single-output system, and in the present application, the ANFIS is a three-input single-output ANFIS structure, specifically, as shown in fig. 8, the structure of the ANFIS is composed of five layers, and from input to output, the two layers are respectively a fuzzy layer, a rule layer, a normalization layer, a deblurring layer, and an output layer.
Specifically, in ANFIS, the parameters of the membership functions are determined by training through a sample data set, and the manner in which the membership functions are combined or interacted with each other is called a rule, and the rule in if-then form is described as follows:
rule 1:
rule 2:
rule 3:
in the above-described form of representation if-then,、/>and->Is the input of the node; />Is output; />、/>And->Respectively is +.>、/>And->Related fuzzy sets, +_>;/>、/>、/>And->Is a result parameter, commonly referred to as a back-piece parameter.
A first layer: the layer node function is a membership function.
(equation 1)
In the above-mentioned formula 1,outputting a value for the layer; />Is the number of input signals; />、/>And->Is defined as generalized bell membership function (gbellmf), defined as
(equation 2)
In the above-mentioned formula 2,、/>and->As a conditional parameter, also commonly referred to as a precursor parameter, a change in the parameter value may change the gbellmf shape.
A second layer: the layer of nodes are marked as. Output value->Is calculated by multiplying all input member functions.
(equation 3)
In the above-mentioned formula 3,indicate->Excitation intensity of bar rule. />
Third layer: the layer of nodes are marked as. Normalizing the output result of the previous layer to obtain output value +.>Is the excitation intensity after the treatment.
(equation 4)
In the above-mentioned formula 4, the formula,is the output value of the third layer.
Fourth layer: the layer creates an adaptive correlation function between the normalized excitation intensity and the result function,is the product of the values of the third layer and the first layer.
(equation 5)
In the above-mentioned formula 5,、/>、/>and->Is a result parameter, commonly referred to as a back-piece parameter.
Fifth layer: the layer is marked as. It calculates the total output as the sum of all the input signals.
(equation 6)
In the above-mentioned formula 6,the sum of all the input signals.
It should be noted that the fuzzy membership function parameters (including the front part parameters and the back part parameters) in the ANFIS are obtained by generating an initial fuzzy model from a large amount of known data and training. Through the iterative self-adaptive learning process, ANFIS is trained, the parameters of the front part and the back part of the model system can be optimally adjusted, and finally membership function parameter values which can be fit to a training data set are determined. In each iterative training, the error between the actual output and the expected output may be reduced, and the training process stopped when a predetermined number of training times or error rate is reached.
At the position ofIn the time period, the actually measured memory temperature, voltage, running time frequency and erasing speed information are actively collected every 1s, and an average value is calculated every ten data to obtain indirect factor information data set_1. Meanwhile, in the time period, using average voltage, average running frequency and average erasing speed data as input information of an ANFIS model obtained in the previous period to obtain a corresponding average temperature output value, comparing the output value with the measured average memory temperature data, and if the comparison error is smaller than a set threshold And the value indicates that the indirect factor information feedback memory has no abnormality in the time period.
At the position ofAt moment, the total indirect factor data set is supplemented by the data set_1, a new training sample is obtained after normalization processing, and training is carried out again on the basis of the ANFIS model rule obtained by training in the previous period, so that the ANFIS model rule can be updated rapidly;
similarly, at timeWithin the section, the above +.>Time period and +.>And (3) performing operation corresponding to the moment to obtain a training sample and a training specific flow of the ANFIS model.
It should be noted that, in order to avoid the data set being too large, a sliding time window may be set, so that the total indirect factor data set may include at most feedback memory data collected in the last 3 time periods without abnormality; acquisition of training samples and training process of ANFIS model can be set at intervalsThe periodic manner of the time periods can also be regarded as an active manner of setting up the aperiodic manner, which is only carried out with the spacing +.>The periodic manner of the time period is illustrated.
Further, in step 203, determining a health-affecting factor according to the error value includes: setting the health-degree influence factor to 1 in the case that the error value is detected to be smaller than a target error threshold value; and under the condition that the error value is detected to be larger than the target error threshold value, acquiring a preset memory health degree strategy, and determining the health degree influence factor according to the memory health degree strategy.
It should be noted that, in the embodiment of the present application, when the feedback result obtained by the indirect factor is the memory exception, the influence of the feedback result on the memory health score is necessarily negative, so that the embodiment of the present application may set the health influence factor rule.
Further, the target error threshold is determined based on a root mean square error corresponding to a target mean square error vector between the predicted output data and the actual output data.
When training the input/output model based on the sample data, the minimum mean square error vector of the training data can be obtained, and the root mean square error of the minimum mean square error vector can be obtainedCan be set to 3 according to a preset empirical formulaAnd the threshold value is used as the threshold value before the next input/output model rule updating and is used for marking the feedback result obtained through the indirect factors.
If the comparison error between the output result obtained based on the input/output model and the measured data is less than 3If the memory is abnormal, namely the indirect factors do not influence the memory health degree strategy, the health degree influence factor is set to be 1; if the contrast error is greater than 3->And setting an indirect factor to influence the memory health degree strategy, wherein the health degree influence factor value changes along with the error.
And 204, determining the current memory state according to the preliminary memory health score and the health influence factor.
It should be noted that, the step 204 is discussed with reference to the foregoing, and is not repeated here.
According to the embodiment of the application, through analyzing various factors influencing the healthy operation of the memory of the server, memory data corresponding to different influencing factors are divided into two types, namely first memory data and second memory data, wherein first historical memory data corresponding to first memory data can generate a preset memory health evaluation model, and the first memory data can determine a preliminary memory health score corresponding to the memory according to the preset memory health evaluation model; secondly, the second historical memory data can be processed to train a preset initial model so as to generate an input/output model, the health degree influence factor of the memory can be determined after the second memory data is input into the trained input/output model, and then the current memory state can be determined according to the preliminary memory health degree score and the health degree influence factor.
In addition, in the embodiment of the application, based on the input/output model, the mapping relation among the measured temperature, voltage, operating frequency and erasing speed information of the memory is obtained. The method for establishing and updating the input/output model rule is designed, the input/output model can be updated on line according to actual measurement data, and the memory risk can be pre-warned to a certain extent according to the prediction result of the input/output model.
In addition, the embodiment of the application provides a memory health degree influence factor concept, a detailed acquisition method of the memory health degree influence factor and a calculation mode of health scores obtained by the health degree influence factor and a preliminary memory health evaluation strategy.
Referring to fig. 3, a flowchart illustrating a step of a memory state detection method according to an embodiment of the present application is shown, where the method may include:
step 301, obtaining memory data;
step 302, dividing the memory data into first memory data and second memory data according to the type of the memory data;
step 303, determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, and processing the second memory data through an input/output model to determine a health influence factor, wherein the preset memory health evaluation model is determined based on the first historical memory data, and the input/output model is generated by training a preset initial model based on the second historical memory data;
It should be noted that, the steps 301-303 are discussed with reference to the foregoing, and are not repeated herein.
Step 304, determining a target memory health score corresponding to the memory at the current time according to the preliminary memory health score, the health influence factor at the current time and the health influence factor in the last preset time period corresponding to the current time.
It should be noted that, in the embodiment of the present application, the target memory health score corresponding to the memory at the current time is determined by the preliminary memory health score, the health influence factor at the current time, and the health influence factor in the last preset time period corresponding to the current time.
Wherein the health degree influence factor in the last preset time period corresponding to the current time point refers to the health degree influence factor in the last time period of the current time point, wherein the time period may refer to the interval set forth in the foregoingPeriod of time period.
Specifically, the target memory health score is generated by the following formula:
(equation 7)
Wherein, in the above formula 7,for representing the target memory health score, < >>For representing said preliminary memory health score, < - >-said health factor for representing the current moment,>for representing the health degree influence factor in the last preset time period corresponding to the current moment,/for the time of the current moment>Mean value representing the correspondence of error values between a plurality of said predicted output data and the actual output data,/->For representing a target error threshold.
And 305, determining the state corresponding to the memory at the current moment according to the target memory health score.
Further, step 305 may include: and sending early warning information to a user under the condition that the target memory health degree score is detected to be smaller than a target health degree threshold value or the health degree influence factor is smaller than or larger than 1, so that the user can check the memory state at the current moment through a preset display interface.
In the embodiment of the present application, the display may be performed through a Web interface. The user can input the login address of the server to inquire the health evaluation result of each memory bank configured by the server, and the display interface schematic diagram 10 is shown.
In addition, when it is detected that the target memory health score is smaller than the target health threshold, or the health influence factor is smaller than or greater than 1, early warning information is sent to the user, so that the user can check the memory state at the current moment through a preset display interface, namely a front Web interface, and the user can check the detailed information of each memory bank through the display interface, as shown in fig. 11. If the health score of a certain memory is less than 60 minutes or the health degree influence factor is not 1, actively giving an early warning to the user.
According to the embodiment of the application, through analyzing various factors influencing the healthy operation of the memory of the server, memory data corresponding to different influencing factors are divided into two types, namely first memory data and second memory data, wherein first historical memory data corresponding to first memory data can generate a preset memory health evaluation model, and the first memory data can determine a preliminary memory health score corresponding to the memory according to the preset memory health evaluation model; secondly, the second historical memory data can be processed to train a preset initial model so as to generate an input/output model, the health degree influence factor of the memory can be determined after the second memory data is input into the trained input/output model, and then the current memory state can be determined according to the preliminary memory health degree score and the health degree influence factor.
In addition, the embodiment of the application can realize scalar evaluation on the memory health degree effectively, and meanwhile, the memory abnormality can be early warned according to the memory health degree score and the health influence factor value, and when the memory health degree score is low to the abnormality threshold value, the user is prompted to replace healthy hardware.
Referring to fig. 5, fig. 5 shows a memory state detection apparatus provided by an embodiment of the present application, where the apparatus may include:
an obtaining module 501, configured to obtain memory data;
the dividing module 502 is configured to divide the memory data into first memory data and second memory data according to the type of the memory data;
a first determining module 503, configured to determine a preliminary memory health score according to a preset memory health evaluation model and the first memory data, and process the second memory data through an input/output model to determine a health impact factor, where the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data;
a second determining module 504, configured to determine a current memory state according to the preliminary memory health score and the health impact factor.
According to the embodiment of the application, through analyzing various factors influencing the healthy operation of the memory of the server, memory data corresponding to different influencing factors are divided into two types, namely first memory data and second memory data, wherein first historical memory data corresponding to first memory data can generate a preset memory health evaluation model, and the first memory data can determine a preliminary memory health score corresponding to the memory according to the preset memory health evaluation model; secondly, the second historical memory data can be processed to train a preset initial model so as to generate an input/output model, the health degree influence factor of the memory can be determined after the second memory data is input into the trained input/output model, and then the current memory state can be determined according to the preliminary memory health degree score and the health degree influence factor.
The embodiment of the application also provides a communication device, as shown in fig. 6, comprising a processor 601, a communication interface 602, a memory 603 and a communication bus 604, wherein the processor 601, the communication interface 602 and the memory 603 complete communication with each other through the communication bus 604,
a memory 603 for storing a computer program;
the processor 601, when executing the program stored in the memory 603, may implement the following steps:
acquiring memory data;
dividing the memory data into first memory data and second memory data according to the type of the memory data;
determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model, and determining a health influence factor, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data;
and determining the current memory state according to the preliminary memory health score and the health influence factor.
The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present application, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the memory state detection described in any of the above embodiments.
In yet another embodiment of the present application, a computer program product comprising instructions that, when executed on a computer, cause the computer to perform the memory state detection of any of the embodiments described above is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or third database to another website, computer, server, or third database by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, third databases, etc. that can be integrated with the available medium. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims (19)

1. The memory state detection method is characterized by comprising the following steps:
acquiring memory data;
dividing the memory data into first memory data and second memory data according to the type of the memory data;
determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model, and determining a health influence factor, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data;
determining a current memory state according to the preliminary memory health score and the health influence factor;
the second memory data comprises actual input data and actual output data, and the actual input data and the actual output data are in one-to-one correspondence;
Processing the second memory data through an input/output model, wherein determining the health degree influence factor comprises:
inputting actual input data into an input-output model to obtain predicted output data corresponding to the actual input data;
comparing the predicted output data with the actual output data to obtain an error value;
and determining a health degree influence factor according to the error value.
2. The method according to claim 1, wherein before the step of inputting actual input data into an input-output model to obtain predicted output data corresponding to the actual input data, the method comprises:
preprocessing actual input data acquired in a plurality of preset time periods to generate corresponding data sets in the preset time periods;
screening target data sets of which the memory state is in a normal state in three preset time periods corresponding to the current moment from the data sets corresponding to the preset time periods;
and carrying out normalization processing on the target data set to obtain a training sample.
3. The method of claim 2, wherein after the step of normalizing the target data set to obtain training samples, the method comprises:
And training a preset initial model according to the training sample to obtain an input and output model.
4. The method of claim 1, wherein said determining a health-impact factor from said error value comprises:
setting the health-degree influence factor to 1 in the case that the error value is detected to be smaller than a target error threshold value;
and under the condition that the error value is detected to be larger than the target error threshold value, acquiring a preset memory health degree strategy, and determining the health degree influence factor according to the memory health degree strategy.
5. The method of claim 4, wherein the target error threshold is determined based on a root mean square error corresponding to a target mean square error vector between the predicted output data and the actual output data.
6. The method of claim 1, wherein determining the current memory state based on the preliminary memory health score and the health-impact factor comprises:
determining a target memory health score corresponding to a memory at the current moment according to the preliminary memory health score, the health influence factor at the current moment and the health influence factor in a last preset time period corresponding to the current moment;
And determining the state corresponding to the memory at the current moment according to the target memory health score.
7. The method of claim 6, wherein the target memory health score is generated by the following formula:
wherein, in the above formula,for representing the target memory health score, < >>For representing said preliminary memory health score, < ->-said health factor for representing the current moment,>for representing the health degree influence factor in the last preset time period corresponding to the current moment,/for the time of the current moment>Mean value representing the correspondence of error values between a plurality of predicted output data and actual output data, +.>For representing a target error threshold.
8. The method of claim 6, wherein determining the state of the memory corresponding to the current time based on the target memory health score comprises:
and sending early warning information to a user under the condition that the target memory health degree score is detected to be smaller than a target health degree threshold value or the health degree influence factor is smaller than or larger than 1, so that the user can check the memory state at the current moment through a preset display interface.
9. The memory state detection method of claim 1, wherein the first memory data is obtained through a register log, the first memory data comprising at least one of: memory hard failure, number of memory errors, memory error type, and enabling error repair operations.
10. The memory state detection method according to claim 9, wherein the predetermined memory health evaluation model includes:
deducting a first preset score under the condition that a preset number of memory hard faults are detected;
deducting a second preset score when the number of memory errors in the preset time is detected to be larger than a preset threshold value;
determining a third preset score corresponding to the memory error type according to the memory error type;
and determining a fourth preset score corresponding to the enabling error repair operation according to the enabling error repair operation.
11. The memory state detection method of claim 10, wherein the memory error type comprises at least one of: memory hard errors, memory soft errors, select to process SRAO errors, do not need to process UCNA errors, must process SRAR errors, and burst fatal errors.
12. The memory state detection method according to claim 11, wherein the determining, according to the memory error type, a third preset score corresponding to the memory error type includes:
deducting the third preset score under the condition that the memory error type is detected to be one of a memory hard error, a memory soft error, an SRAO error, a UCNA error and an SRAR error;
and under the condition that the memory error type is detected to be a sudden fatal error, the preliminary memory health score is 0.
13. The memory state detection method of claim 10, wherein enabling an error repair operation comprises at least one of: consuming intra-granular redundant row replacement failure row technology, enabling a Bank-level adaptive dual DRAM device correction addrdc function, enabling a Rank-level adaptive dual DRAM device correction addrdc function, and enabling a storm suppression function.
14. The memory state detection method according to claim 13, wherein the score order corresponding to the enabling error repair operation is from small to large: consuming PCLS, enabling a Bank level ADDC function, enabling a Rank level ADDC function, and enabling a storm suppression function.
15. The memory state detection method according to claim 1, wherein the data set corresponding to the input-output model includes four columns of vector data, and the input-output model includes three input data corresponding to one output data.
16. The method of claim 15, wherein the three input data includes average voltage, average frequency during operation, and average erase speed data corresponding to the memory, and the output data includes average temperature data corresponding to the memory.
17. A memory state detection apparatus, the apparatus comprising:
the acquisition module is used for acquiring the memory data;
the dividing module is used for dividing the memory data into first memory data and second memory data according to the type of the memory data;
the first determining module is used for determining a preliminary memory health score according to a preset memory health evaluation model and the first memory data, processing the second memory data through an input/output model and determining health influence factors, wherein the preset memory health evaluation model is determined based on first historical memory data, and the input/output model is generated by training a preset initial model based on second historical memory data;
The second determining module is used for determining the current memory state according to the preliminary memory health score and the health influence factor;
the second memory data comprises actual input data and actual output data, and the actual input data and the actual output data are in one-to-one correspondence;
processing the second memory data through an input/output model, wherein determining the health degree influence factor comprises:
inputting actual input data into an input-output model to obtain predicted output data corresponding to the actual input data;
comparing the predicted output data with the actual output data to obtain an error value;
and determining a health degree influence factor according to the error value.
18. A communication device, comprising: a transceiver, a memory, a processor, and a program stored on the memory and executable on the processor;
the processor is configured to read a program in a memory to implement a memory state detection method according to any one of claims 1 to 16.
19. A readable storage medium storing a program, wherein the program when executed by a processor implements a memory state detection method according to any one of claims 1-16.
CN202310935420.8A 2023-07-28 2023-07-28 Memory state detection method, device, communication equipment and storage medium Active CN116680112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310935420.8A CN116680112B (en) 2023-07-28 2023-07-28 Memory state detection method, device, communication equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310935420.8A CN116680112B (en) 2023-07-28 2023-07-28 Memory state detection method, device, communication equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116680112A CN116680112A (en) 2023-09-01
CN116680112B true CN116680112B (en) 2023-11-03

Family

ID=87785814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310935420.8A Active CN116680112B (en) 2023-07-28 2023-07-28 Memory state detection method, device, communication equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116680112B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522175A (en) * 2017-09-18 2019-03-26 华为技术有限公司 A kind of method and device of memory assessment
CN114883013A (en) * 2022-06-08 2022-08-09 中国工商银行股份有限公司 Health state evaluation method and device and computer equipment
CN115186924A (en) * 2022-07-28 2022-10-14 网思科技股份有限公司 Equipment health state evaluation method and device based on artificial intelligence
CN115793990A (en) * 2023-02-06 2023-03-14 天翼云科技有限公司 Memory health state determination method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522175A (en) * 2017-09-18 2019-03-26 华为技术有限公司 A kind of method and device of memory assessment
CN114883013A (en) * 2022-06-08 2022-08-09 中国工商银行股份有限公司 Health state evaluation method and device and computer equipment
CN115186924A (en) * 2022-07-28 2022-10-14 网思科技股份有限公司 Equipment health state evaluation method and device based on artificial intelligence
CN115793990A (en) * 2023-02-06 2023-03-14 天翼云科技有限公司 Memory health state determination method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116680112A (en) 2023-09-01

Similar Documents

Publication Publication Date Title
JP7158586B2 (en) Hard disk failure prediction method, apparatus and storage medium
US7769562B2 (en) Method and apparatus for detecting degradation in a remote storage device
WO2021253708A1 (en) Memory fault handling method and apparatus, device and storage medium
US11868201B2 (en) Memory evaluation method and apparatus
CN102301427A (en) Analyzing monitor data information from memory devices having finite endurance and/or retention
WO2022028209A1 (en) Memory failure processing method and apparatus
CN112308126A (en) Fault recognition model training method, fault recognition device and electronic equipment
CN112433896B (en) Method, device, equipment and storage medium for predicting server disk faults
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
CN113918375B (en) Fault processing method and device, electronic equipment and storage medium
WO2019000206A1 (en) Methods and apparatus to perform error detection and/or correction in a memory device
CN110413492B (en) Method for health assessment of a storage disk, electronic device and computer program product
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
CN111626498B (en) Equipment running state prediction method, device, equipment and storage medium
JP2018163707A (en) Semiconductor storage device and read control method of the same
DE102018120964A1 (en) Integrated circuit memory devices with improved buffer usage during read and write operations
Zivanovic et al. DRAM errors in the field: A statistical approach
Jian et al. Analyzing reliability of memory sub-systems with double-chipkill detect/correct
CN109801668A (en) Data memory device and the operating method being applied thereon
Cheng et al. An in-depth correlative study between DRAM errors and server failures in production data centers
Zhang et al. Predicting dram-caused node unavailability in hyper-scale clouds
CN116680112B (en) Memory state detection method, device, communication equipment and storage medium
Zheng et al. Software-hardware embedded system reliability modeling with failure dependency and masked data
US20210320676A1 (en) Llr estimation for soft decoding
US8850290B2 (en) Error rate threshold for storage of data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant