WO2023142429A1 - 一种易失性存储介质不可纠正错误的预测方法和相关设备 - Google Patents

一种易失性存储介质不可纠正错误的预测方法和相关设备 Download PDF

Info

Publication number
WO2023142429A1
WO2023142429A1 PCT/CN2022/111694 CN2022111694W WO2023142429A1 WO 2023142429 A1 WO2023142429 A1 WO 2023142429A1 CN 2022111694 W CN2022111694 W CN 2022111694W WO 2023142429 A1 WO2023142429 A1 WO 2023142429A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage medium
volatile storage
error
work information
failure
Prior art date
Application number
PCT/CN2022/111694
Other languages
English (en)
French (fr)
Inventor
董伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023142429A1 publication Critical patent/WO2023142429A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals

Definitions

  • the embodiments of the present application relate to the field of memory, and mainly relate to a method for predicting uncorrectable errors of volatile storage media, a computing device, computing equipment, a chip system, and a computer-readable storage medium.
  • the problem of errors occurring in the volatile storage medium becomes more and more prominent. Errors that occur on volatile storage media can be classified into correctable errors and uncorrectable errors.
  • the computing device can correct the error in time, so the correctable error has less impact on the storage device or the computing device, and the health status of the volatile storage medium is better at this time.
  • the storage device may be in the computing device, or the storage device may be connected to the computing device.
  • the computing device cannot correct the error, which will cause the interruption of the work of the storage device or computing device, or even cause the downtime of the computing device. At this time, the health of the volatile storage medium poor condition.
  • Embodiments of the present application provide a method for predicting uncorrectable errors of volatile storage media, computing devices, computing equipment, chip systems, and computer-readable storage media, which can predict uncorrectable errors of volatile storage media, thereby judging volatile The health status of permanent storage media.
  • a method for predicting uncorrectable errors of volatile storage media includes: obtaining a work information set of the volatile storage medium in a storage device; Risk assessment results of uncorrectable errors occurring in permanent storage media.
  • the storage device may be in or connected to the computing device.
  • the storage device may be a storage medium, such as memory or cache.
  • the storage device may also include a non-volatile storage medium, such as a solid-state hard disk, and the volatile storage medium may be a cache memory (cache) in the solid-state hard disk.
  • the set of work information includes information on correctable errors that occur in volatile storage media
  • the information on correctable errors includes any one or more of the following information: time when correctable errors occur, time of correctable errors The address of the erroneous data in the volatile storage medium, or the erroneous data of which the error can be corrected.
  • the risk assessment result of an uncorrectable error occurring in the volatile storage medium may be directly determined according to the work information set and the prediction model of the volatile storage medium.
  • the failure cause of the volatile storage medium may be determined according to the work information set of the volatile storage medium and the first prediction model in the prediction model, thereby determining the risk assessment result of an uncorrectable error occurring in the volatile storage medium .
  • the risk assessment result of an uncorrectable error occurring in a volatile storage medium includes any of the following: high risk, medium risk, or low risk. If the risk assessment result of the uncorrectable error occurring in the volatile storage medium is high risk, it means that the health status of the volatile storage medium is poor and needs to be replaced. If the risk assessment result of an uncorrectable error in a volatile storage medium is low risk, it means that the volatile storage medium is in good health and does not need to be replaced.
  • the computing device can determine the risk assessment result of an uncorrectable error occurring in the volatile storage medium based on the correctable error information and the prediction model of the volatile storage medium in the storage device, thereby judging the The health status of volatile storage media.
  • the computing device can guide the user to replace the volatile storage medium according to the health status of the volatile storage medium, so as to avoid affecting the normal operation of the storage device or the volatile storage medium.
  • the prediction model includes a first prediction model and a second prediction model, and according to the work information set and the first prediction model, the fault cause is determined; according to the fault cause and the second prediction model and b. a predictive model that determines the risk assessment results.
  • the failure cause of the volatile storage medium may be directly determined according to the work information set of the volatile storage medium and the first prediction model.
  • the error feature set of the volatile storage medium may be determined according to the work information set of the volatile storage medium, so as to determine the cause of the failure of the volatile storage medium.
  • the computing device may determine the specific cause of the failure of the volatile storage medium according to the correctable error information of the volatile storage medium and the first prediction model. And the computing device can determine the risk assessment result of the uncorrectable error occurring in the volatile storage medium according to the failure cause of the volatile storage medium and the second prediction model. The computing device can judge the health status of the volatile storage medium according to the risk assessment result of uncorrectable errors in the volatile storage medium, so as to guide the user to replace it, so as to avoid affecting the normal operation of the storage device or the volatile storage medium.
  • each piece of work information in the work information set includes the address of the error data that can correct the error in the volatile storage medium, and the work information set also includes the total access times of the volatile storage medium In this case, determine the number of correctable errors according to the number of work information included in the work information set; determine the volatile A set of error features of the permanent storage medium; according to the set of error features and the first prediction model, the cause of the failure is determined.
  • the set of error characteristics includes any one or more of the following information: the error rate of the volatile storage medium, the number of correctable errors that occur per unit time, or the storage of correctable errors in the volatile storage medium distribution in the unit.
  • the error rate is the ratio of the number of occurrences of correctable errors to the total number of accesses to the volatile storage medium.
  • the number of correctable errors per unit time is the ratio of the number of correctable errors to the length of the statistical period.
  • the storage unit may include any one or more of the following: a storage matrix (bank), a storage row (row), a storage column (column), a storage block (rank), or a bidirectional data bus (data queue, DQ) . That is to say, the distribution situation may include any one or more of the identification of the storage matrix to which the error-correctable address belongs, the identification of the storage row, the identification of the storage column, the identification of the storage block, or the identification of the DQ are the same.
  • the computing device can determine the error feature set of the volatile storage medium according to the correctable error information of the volatile storage medium, so as to determine the specific cause of the failure of the volatile storage medium .
  • the computing device may also determine a risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the failure cause of the volatile storage medium and the second prediction model.
  • the cause of the failure of the volatile storage medium includes any one or more of the following: capacitor leakage, word line failure, sub-word line driver failure, main word line driver failure , bit line failure, sensitive amplifier failure, memory matrix control circuit failure, poor contact, or insufficient signal margin.
  • the computing device can determine the specific types of failure causes of the volatile storage medium according to the work information set of the volatile storage medium and the first prediction model, so as to determine that the volatile storage medium has Uncorrectable erroneous risk assessment results.
  • each piece of work information in the work information set includes error data that can correct errors
  • the error data that can correct errors included in each piece of work information Perform logic operations with the correct data corresponding to the wrong data to obtain the operation results corresponding to each piece of work information; determine the risk assessment results based on the uncorrectable error model, the operation results corresponding to each piece of work information, and the prediction model.
  • the logical operation may be any one of logical operations such as an exclusive OR operation, an exclusive OR operation, an AND operation, or an OR operation.
  • the uncorrectable error model is data determined according to the error correction algorithm of the volatile storage medium.
  • the computing device may obtain the calculation result of the error data and the correct data according to the correctable error data and the corresponding correct data occurred in the volatile storage medium.
  • the computing device can also determine the risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the uncorrectable error model, the calculation result and the prediction model.
  • the computing device can judge the health status of the volatile storage medium according to the risk assessment result of uncorrectable errors in the volatile storage medium, so as to guide the user to replace it, so as to avoid affecting the normal operation of the storage device or the volatile storage medium.
  • the uncorrectable error model is compared with the calculation results corresponding to each piece of work information to obtain the corresponding similarity of each piece of work information; according to the corresponding The similarity and prediction model determines the risk assessment results corresponding to each piece of work information; the risk assessment results with the highest level are used as the risk assessment results of uncorrectable errors in volatile storage media.
  • the similarity corresponding to each piece of work information is the similarity between the calculation result corresponding to each piece of work information and the uncorrectable error model.
  • the similarity corresponding to each piece of work information is high, it can indicate that the correctable error in this piece of work information cannot be corrected by the error correction algorithm.
  • the risk assessment result is high risk. If the similarity corresponding to each piece of work information is low, it can mean that the probability that the correctable error in this piece of work information cannot be corrected by the error correction algorithm is small, that is, it can be determined that the risk assessment result corresponding to this piece of work information is low risk.
  • the computing device can obtain the similarity degree corresponding to each piece of work information, and can determine the risk assessment result corresponding to each piece of work information according to the similarity degree corresponding to each piece of work information, so that the risk assessment result with the highest level
  • the result is an assessment of the risk of an uncorrectable error occurring against the volatile storage medium.
  • the computing device can judge the health status of the volatile storage medium according to the risk assessment result of uncorrectable errors in the volatile storage medium, so as to guide the user to replace it, so as to avoid affecting the normal operation of the storage device or the volatile storage medium.
  • a computing device in a second aspect, includes a module for realizing the first aspect or any possible implementation manner of the first aspect.
  • a computing device in a third aspect, includes a processor, the processor is used to be coupled with a memory, read and execute instructions and/or program codes in the memory, to implement the first aspect or the first aspect any possible implementation of .
  • a chip system in a fourth aspect, includes a logic circuit, the logic circuit is used to couple with an input/output interface, and transmit data through the input/output interface, so as to perform any one of the first aspect or the first aspect. a possible implementation.
  • a computer-readable storage medium stores program codes, and when the computer-readable storage medium is run on a computer, the computer executes any one of the first aspect or the first aspect. a possible implementation.
  • an embodiment of the present application provides a computer program product, the computer program product comprising: computer program code, when the computer program code is run on a computer, the computer is made to execute any of the first aspect or the first aspect.
  • a computer program product comprising: computer program code, when the computer program code is run on a computer, the computer is made to execute any of the first aspect or the first aspect.
  • FIG. 1 is a schematic system architecture diagram of a computing device.
  • Fig. 2 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium according to an embodiment of the present application.
  • Fig. 3 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium according to another embodiment of the present application.
  • Fig. 4 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium according to another embodiment of the present application.
  • Fig. 5 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium according to another embodiment of the present application.
  • Fig. 6 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • the storage device in the embodiment of the present application can be a volatile memory, such as a memory, a cache, a random access memory (random access memory, RAM), a static random access memory (static random access memory, SRAM), a dynamic random access memory Access memory (dynamic random access memory, DRAM), synchronous dynamic random access memory (synchronous dynamic random access memory, SDRAM), dual in-line memory module (dual in-line memory module, DIMM), non-cache dual In-line memory module (unbuffered DIMM, UDIMM), dual in-line memory module with register (registered DIMM, RDIMM), load reduced dual in-line memory module (load reduced DIMM, LRDIMM), double data transfer rate Synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), graphic double data rate synchronous dynamic random access memory (graphics double data rate SDRAM, GDDR SDRAM), low power consumption double data rate synchronous dynamic Random access memory (low power double data rate SDRAM, LPDDR SDRAM), high bandwidth memory (high bandwidth memory, HBM),
  • the storage device in the embodiment of the present application may also be a memory including a volatile storage medium and a nonvolatile storage medium, such as a solid state hard disk.
  • the volatile storage medium in the storage device may be a high-speed cache (cache) in the solid-state disk.
  • the storage device in this embodiment of the present application may be a cache outside the core of a processor or a system on chip (SOC).
  • the processor can be a central processing unit (CPU) or a graphics processing unit (GPU), etc.
  • the storage device can be a first-level cache (level 1 cache, L1 cache) or a second-level cache (level 2 cache). cache, L2 cache), etc., the embodiments of the present application are not limited.
  • FIG. 1 is a schematic system architecture diagram of a computing device 100 .
  • the computing device 100 may include a processor 110, a control circuit 111, an arithmetic circuit 112, a cache controller 113, a cache 114, a memory controller 120, a memory 121, an external memory interface 130, a speaker 140, a display screen 150, and the like.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the computing device 100 .
  • the computing device 100 may include more or fewer components than shown, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 110 includes a control circuit 111 , an operation circuit 112 , a cache controller 113 and a cache 114 .
  • computing device 100 may also include one or more processors 110 .
  • the processor 110 may be a CPU or a GPU or the like.
  • the processor 110 may obtain the working information set of the volatile memory from the controller of the volatile memory, so as to determine the risk assessment result of an uncorrectable error occurring in the volatile memory, so as to judge the health status of the volatile memory.
  • the processor 110 may obtain the working information set of the cache 114 from the cache controller 113 , so as to determine the risk assessment result of an uncorrectable error occurring in the cache 114 .
  • the processor 110 may obtain a set of working information of the memory 121 from a hardware register in the memory controller 120, so as to determine a risk assessment result of an uncorrectable error occurring in the memory 121, so as to judge the health status of the memory.
  • the processor 110 may also obtain an uncorrectable error pattern (pattern), which is data determined according to an error correction algorithm of the volatile memory.
  • the processor 110 may determine a risk assessment result of an uncorrectable error occurring in the volatile memory according to the uncorrectable error pattern and the working information set of the volatile memory. When the probability of an uncorrectable error occurring in the volatile memory is low, the health status of the volatile memory is better, and the volatile memory does not need to be replaced at this time.
  • the volatile memory may be cache 114 or memory 121 .
  • the volatile memory may be a volatile memory connected to the processor 110 through the external memory interface 130 or a non-volatile memory including a volatile storage medium. The embodiment of this application is not limited.
  • the control circuit 111 may include an instruction register, an instruction decoder, and an operation controller.
  • the control circuit 111 can obtain one or more instructions from the cache 114 or the memory 121 according to a preset program.
  • the control circuit 111 can also determine the operations to be performed according to the obtained instructions, and send micro-operation control signals to corresponding components.
  • the arithmetic circuit 112 can obtain data from the cache memory 114 according to the control instruction from the control circuit 111 and perform arithmetic or logic operations.
  • the buffer memory 114 can store instructions or data that have just been used or are used repeatedly by the control circuit 111 . If the control circuit 111 needs to use the instruction or data again, it can be directly called from the cache 114 . In this way, repeated access is avoided, and the waiting time of the control circuit 111 is reduced, thereby improving the efficiency of the computing device 100 in processing data or executing instructions.
  • the cache controller 113 may detect whether an error occurs in the cache, and the error may be a correctable error or an uncorrectable error. The cache controller 113 may also collect working information of the cache 114 when detecting that a correctable error occurs in the cache, so that the processor 110 may obtain a set of working information of the cache 114 through the cache controller 113 .
  • the working information set of the cache 114 includes the correctable error information that occurs in the cache 114, and the correctable error information may include any one or more of the following information: the time when the correctable error occurred, the error data of the correctable error address in cache, or erroneous data that can correct errors.
  • the working information set of the cache 114 may also include the total access times of the cache 114 or the correct data corresponding to the correctable error data.
  • the cache controller 113 may detect whether an error occurs in the cache by using an error correction code (error correction code, ECC) algorithm.
  • ECC error correction code
  • the specific detection method is: when data is written into the cache, the ECC algorithm can generate the first error check code according to the data, and add it to the extra data bits of the data, and the data and the first error check code can be saved in the cache middle.
  • the ECC algorithm can generate a second error check code according to the read data, compare the first error check code with the second error check code, and detect whether an error occurs in the cache.
  • the first error check code is the same as the second error check code, it means that no error occurs in the cache 114; if the first error check code and the second error check code are different, it means that an error occurs in the cache 114. If the error occurred in the cache 114 is a correctable error, the specific erroneous data bit can be determined by using the first error check code and the second error check code, so as to obtain correct data. If the error occurred in the cache is an uncorrectable error, correct data cannot be obtained according to the first error check code and the second error check code. That is to say, when an uncorrectable error occurs in the cache, the data read from the cache is erroneous data, and the erroneous data may affect the entire computing device.
  • the 0th to 7th bit data of the write data are 0, 1, 1, 0, 1, 0, 0, 1 respectively.
  • the ECC algorithm XOR is performed on the 0th, 2nd, 4th, and 6th bits of the written data, and the check bits of the 0th, 2nd, 4th, and 6th bits of the written data can be obtained as 0.
  • the check digits of the 0, 1, 4, and 5 bits of the written data are 0, and the check bits of the 0, 1, 2, and 3 bits of the written data are 0, and the written data
  • the check digits of the 4th, 5th, 6th, and 7th digits are 0.
  • the ECC algorithm XOR is performed on the 0th bit to the 7th bit of the written data, and the row parity bit of the written data can be obtained as 0. That is to say, according to the written data 10010110, it can be determined that the first error check code of the written data is 00000.
  • the 0th to 5th digits of the first error check code are respectively the check digits of the 0th, 2nd, 4th, and 6th digits of the written data, and the 0th, 1st, 4th, and 5th digits of the written data.
  • the second error check code of the read data can be obtained as 10111 according to the ECC algorithm. Since the second error checking code is different from the first error checking code, it can be determined that an error has occurred. Since the parity bits of the 4th, 5th, 6th, and 7th bits of the read data are 0, and the rest of the parity bits are 1, it can be assumed that a data bit error occurs in the read data. At the same time, since the check bits of the 4th, 5th, 6th, and 7th bits of the read data are the same as the check bits of the 4th, 5th, 6th, and 7th bits of the written data, the 4th bit of the read data can be determined. , 5, 6, and 7 have no errors.
  • the check bits of the 0, 2, 4, and 6 bits of the read data are 1, the check bits of the 0, 1, 4, and 5 bits of the read data are 1, and the 0 bit of the read data , 1, 2, and 3 bits are 1, so it can be determined that the erroneous data bit is the 0th bit. Repair the 0th bit of the read data, and you can get 10010110.
  • the third error check code that can be obtained for the repaired data is 000000, which is the same as the first error check code. Therefore, according to the ECC algorithm, the read data can be repaired to 10010110. Since the repaired data is consistent with the written data, the occurrence of a correctable error will not affect the computing device 100 .
  • the second error check code of the read data can be obtained as 00001 according to the ECC algorithm. Since the second error checking code is different from the first error checking code, it can be determined that an error has occurred. Since the check bits of bits 0, 2, 4, and 6 of the read data are 1, and the remaining check bits are 0, it can be assumed that two data bit errors have occurred in the read data.
  • the check bits of the 0, 2, 4, and 6 bits of the read data are 1, the check bits of the 0, 1, 4, and 5 bits of the read data are 0, and the check bits of the read data
  • the check digits of 0, 1, 2, and 3 bits are 0, and the check bits of the 4th, 5th, 6th, and 7th bits of the read data are 0, then the 0, 2, and 4th bits of the read data can be determined 1.
  • the third error check code that can be obtained from the repaired data is 00000, which is the same as the first error check code. Therefore, according to the ECC algorithm, the read data can be repaired as 01011001. Since the repaired data is inconsistent with the written data, an uncorrectable error occurs, which may affect the computing device 100 .
  • the memory controller 120 can control the memory 121, and can be responsible for data exchange between the memory 121 and the processor 110.
  • the memory controller 120 may also detect whether an error occurs in the memory 121, and the error may include a correctable error or an uncorrectable error.
  • the memory controller 120 may collect working information of the memory 121 when a correctable error occurs in the memory, so that the processor 110 may obtain a set of working information of the memory 121 from the memory controller 120 .
  • the work information set of memory 121 contains the information that correctable errors occurred in memory 121, and each piece of work information in the work information set can include any one or more of the following information: the time when correctable errors occurred, the time of correctable errors The address in memory of the erroneous data, or the erroneous data where the error can be corrected.
  • the working information set of the memory 121 may also include the total access times of the memory 121 or the correct data corresponding to the correctable error data.
  • the memory controller 120 may detect whether an error occurs in the service memory through an ECC algorithm.
  • the business memory is the memory that is exchanging data with the processor 110 or external memory.
  • the memory controller 120 may detect whether an error occurs in the service memory through a background hardware engine in the memory controller 120 .
  • the specific implementation method is: the background of the hardware engine reads the data in the service memory without affecting the normal reading and writing. If the second error check code calculated according to the read data is the same as the first If the error check codes are not the same, it means that there is an error in the service memory.
  • the memory controller 120 may detect whether an error occurs in the free memory through a memory management module in the memory controller 120 .
  • the specific implementation method is: the memory management module writes the data into the free memory, and then reads the data from the free memory, and compares the data at the time of writing with the data at the time of reading. If the data at the time of writing is the same as the data at the time of reading, it means that there is no error in the free memory. If the data at the time of writing is inconsistent with the data at the time of reading, it means that an error has occurred in the free memory.
  • the external memory interface 130 can be used to connect an external memory, such as a volatile memory or a non-volatile memory, etc., so as to expand the storage capacity of the computing device 100 .
  • the external memory communicates with the processor 110 through the external memory interface 130 to implement a data storage function.
  • the computing device 100 can implement audio functions through the speaker 140, such as playing music and the like.
  • the display screen 150 is used to display text, images, videos and the like.
  • the display screen 150 includes a display panel.
  • the display panel can adopt liquid crystal display (liquid crystal display, LCD), organic light-emitting diode (organic light-emitting diode, OLED), active-matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
  • the computing device 100 realizes the display function through the display screen 150 .
  • computing device 100 may include one or more display screens 150 .
  • the computing device 100 in FIG. 1 can send prompt information to the user through the speaker 140 or the display screen 150 .
  • the prompt information may be used to indicate that an uncorrectable error has occurred in the volatile storage medium in the computing device 100, or the prompt information may be used to indicate the risk of an uncorrectable error occurring in the volatile storage medium in the computing device 100 evaluation result.
  • the prompt information may be used to indicate the identification information of the volatile storage medium in the computing device 100 where an uncorrectable error occurs.
  • the identification information may include information such as a product number or a specific location of the volatile storage medium where the uncorrectable error occurred.
  • the computing device 100 in FIG. 1 can predict the uncorrectable errors of the volatile storage medium, thereby judging the health status of the volatile storage medium, so as to guide the user to replace it, so as to avoid affecting the normal operation of the computing device or the volatile storage medium .
  • Fig. 2 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium, and the method in Fig. 2 includes the following steps.
  • the computing device may obtain the work information set of the volatile storage medium in the storage device, and the storage device may be in the computing device, or the storage device may be connected to the computing device.
  • the computing device may obtain the work information set of the volatile storage medium continuously in real time, or periodically obtain the work information set of the volatile storage medium.
  • the computing device may also obtain the work information set of the volatile storage medium after the nth correctable error occurs in the volatile storage medium, where n is a preset threshold.
  • the computing device may obtain the work information set of the volatile storage medium after receiving the obtaining instruction, which is not limited in this embodiment of the present application.
  • the work information set may include information on correctable errors that occur in volatile storage media, and the information on correctable errors may include any one or more of the following information: time when correctable errors occurred, correctable error The address of the erroneous data in the volatile storage medium, or the erroneous data that can correct the error.
  • any correctable error information in the volatile storage medium may be a piece of work information. That is to say, the work information set may include at least one piece of work information, and each piece of work information in the at least one piece of work information is information on a correctable error occurred in the volatile storage medium. Each piece of work information may include any one or more of the following information: the time when the correctable error occurred, the address of the correctable error data in the volatile storage medium, or the correctable error data.
  • the address of the error-correctable error data in the volatile storage medium may include: the identification of the storage matrix (bank) to which the error-correctable error data belongs in the volatile storage medium, the address of the error data in the volatile storage medium Any one or more of the identification of the storage row (row) or the identification of the storage column (column) in the storage matrix.
  • the address of the correctable error data in the volatile storage medium may also include: the identifier of the DQ to which the correctable error data belongs in the volatile device or the correctable error data in the The identification of the storage block (rank) in the volatile storage medium.
  • the work information set of the volatile storage medium may also include the total access times of the volatile storage medium or the correct data corresponding to the correctable error data.
  • the computing device can evaluate the risk of uncorrectable errors in the volatile storage medium according to the work information collection and prediction model of the volatile storage medium, so as to obtain the risk assessment result.
  • the computing device may directly evaluate the risk of an uncorrectable error occurring in the volatile storage medium according to the work information set of the volatile storage medium.
  • the computing device may determine the risk assessment result of an uncorrectable error occurring in the volatile storage medium according to any one or more pieces of information included in each piece of work information in the work information set.
  • the address of the erroneous data of the correctable error in the volatile storage medium belongs to the same storage matrix, it can indicate that the probability of an uncorrectable error occurring in the volatile storage medium is low, that is, the volatile storage medium has an uncorrectable error. The risk of correcting errors is low.
  • the address of the erroneous data of the correctable error in the volatile storage medium belongs to the same storage row, it may indicate that the probability of an uncorrectable error occurring in the volatile storage medium is low, that is, the volatile storage medium has an uncorrectable error. The risk of correcting errors is low.
  • the address of the erroneous data in the volatile storage medium that can correct the error belongs to the same storage column, it can indicate that the probability of an uncorrectable error occurring in the volatile storage medium is low, that is, the volatile storage medium has an uncorrectable error. The risk of correcting errors is low.
  • the addresses of erroneous data that can correct errors in the volatile storage medium belong to the same storage matrix, and each erroneous data belongs to the same storage row in the storage matrix, it can indicate that the volatile storage medium is uncorrectable.
  • the probability of errors is low, that is, the risk of uncorrectable errors occurring on the volatile storage medium is low.
  • the addresses of erroneous data that can correct errors in the volatile storage medium belong to the same storage matrix, and each erroneous data belongs to the same storage column in the storage matrix, it can indicate that the volatile storage medium is uncorrectable.
  • the probability of errors is low, that is, the risk of uncorrectable errors occurring on the volatile storage medium is low.
  • the volatile storage medium has a low probability of uncorrectable errors, that is, the volatile storage medium has a low risk of uncorrectable errors.
  • each erroneous data belongs to the same DQ, and each erroneous data belongs to the same storage matrix in the corresponding DQ, and belongs to the same storage column and the same storage row in the corresponding storage matrix. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • each erroneous data belongs to a different DQ
  • each erroneous data belongs to a different storage matrix in the corresponding DQ
  • each erroneous data belongs to a different storage column or a different storage row in the corresponding storage matrix. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the time when a correctable error occurs in a volatile storage medium exceeds the preset time range, it may indicate that the probability of an uncorrectable error occurring in the volatile storage medium is low, that is, the probability of an uncorrectable error occurring in the volatile storage medium The risk is low.
  • the time when a correctable error occurs in a volatile storage medium is within a preset time range, it may indicate that the probability of an uncorrectable error occurring in the volatile storage medium is high, that is, an uncorrectable error occurs in the volatile storage medium higher risk.
  • the computing device may determine the number of correctable errors in the volatile storage medium according to the number of pieces of work information included in the work information set.
  • the first preset threshold may be a positive integer greater than or equal to 10 and less than or equal to 40.
  • the first preset threshold may be 20, 25, 30 and so on.
  • the second preset threshold may be a positive integer greater than 70 and less than or equal to 100.
  • the second preset threshold may be 80, 85, 90 and so on.
  • the first preset threshold or the second preset threshold is set larger, it is possible that an uncorrectable error has occurred in the volatile storage medium before the risk assessment result of an uncorrectable error occurring in the volatile storage medium is determined, that is, The lower the accuracy of the risk assessment results for determining the occurrence of uncorrectable errors on volatile storage media. If the first preset threshold or the second preset threshold is set smaller, it is possible to determine the risk assessment of an uncorrectable error occurring in the volatile storage medium when the probability of an uncorrectable error occurring in the volatile storage medium is low The result is medium risk or high risk, that is, the lower the accuracy of the risk assessment result in determining the occurrence of uncorrectable errors on volatile storage media.
  • the number of correctable errors occurring in the volatile storage medium is higher than the second preset threshold, it may indicate that the probability of uncorrectable errors occurring in the volatile storage medium is relatively high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the third preset threshold may be a positive integer greater than 700 and less than or equal to 1000.
  • the third preset threshold may be 800, 850, 900 and so on.
  • the fourth preset threshold may be a positive integer greater than 100 and less than or equal to 400.
  • the fourth preset threshold may be 200, 250, 300 and so on.
  • the third preset threshold or the fourth preset threshold is set larger, it may be that an uncorrectable error has occurred in the volatile storage medium before the risk assessment result of an uncorrectable error occurring in the volatile storage medium is determined, that is, The lower the accuracy of the risk assessment results for determining the occurrence of uncorrectable errors on volatile storage media. If the third preset threshold or the fourth preset threshold is set smaller, it is possible to determine the risk assessment of an uncorrectable error occurring in the volatile storage medium when the probability of an uncorrectable error occurring in the volatile storage medium is low The result is medium risk or high risk, that is, the lower the accuracy of the risk assessment result in determining the occurrence of uncorrectable errors on volatile storage media.
  • the number of correctable errors in the volatile storage medium is lower than the first preset threshold, and the time for correctable errors in the volatile storage medium exceeds the preset time range, it may indicate that the volatile storage medium The probability of uncorrectable errors is low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the number of correctable errors in the volatile storage medium is higher than the second preset threshold, and the time for correctable errors in the volatile storage medium is within the preset time range, it may indicate that the volatile storage medium The media has a high probability of uncorrectable errors. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the number of correctable errors in the volatile storage medium is lower than the first preset threshold, and the address of the error data in the volatile storage medium belongs to the same storage matrix, it may indicate that the volatile The probability of uncorrectable errors on volatile storage media is low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the number of correctable errors in the volatile storage medium is higher than the second preset threshold, and the addresses of the error data in the volatile storage medium belong to different storage matrices, it may indicate that the volatile The probability of uncorrectable errors in volatile storage media is high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the number of correctable errors that occur in the volatile storage medium is lower than the first preset threshold, and the address of the error data that can correct errors in the volatile storage medium is situation 1, it may indicate that the volatile The probability of uncorrectable errors on permanent storage media is low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the volatile Permanent storage media have a higher probability of uncorrectable errors. That is to say, at this moment, the risk of an uncorrectable error on the volatile storage medium is high.
  • the number of correctable errors in the volatile storage medium is lower than the first preset threshold, the total number of access times of the volatile storage medium is higher than the third preset threshold, and the error data that can correct errors is in the volatile If the address in the volatile storage medium is in case one, it can mean that the probability of an uncorrectable error occurring in the volatile storage medium at this time is relatively low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the number of correctable errors in the volatile storage medium is higher than the second preset threshold, the total number of access times of the volatile storage medium is lower than the fourth preset threshold, and the error data with correctable errors is in the volatile storage medium.
  • the address in the volatile storage medium is in case 2, it may indicate that the probability of an uncorrectable error occurring in the volatile storage medium at this time is relatively high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the time for correctable errors in the volatile storage medium exceeds the preset time range, and the error data with correctable errors is in the volatile storage medium. If the address in the volatile storage medium is in case one, it can mean that the probability of an uncorrectable error occurring in the volatile storage medium at this time is relatively low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the time for correctable errors in the volatile storage medium is within the preset time range, and the error data with correctable errors is within the volatile storage medium. If the address in the volatile storage medium is in the second case, it may indicate that the probability of an uncorrectable error occurring in the volatile storage medium at this time is relatively high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the number of correctable errors in the volatile storage medium is lower than the first preset threshold, the total number of access times of the volatile storage medium is higher than the third preset threshold, and the volatile storage medium has correctable errors. If the error time exceeds the preset time range, it may indicate that the probability of an uncorrectable error occurring on the volatile storage medium at this time is low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the number of correctable errors in the volatile storage medium is higher than the second preset threshold, the total number of access times of the volatile storage medium is lower than the fourth preset threshold, and the volatile storage medium has correctable errors. If the error time is within the preset time range, it may indicate that the probability of an uncorrectable error occurring on the volatile storage medium at this time is relatively high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the number of correctable errors that occur in the volatile storage medium is lower than the first preset threshold, and the total number of access times of the volatile storage medium is higher than the third preset threshold, correctable errors occur in the volatile storage medium If the time exceeds the preset time range, and the address of the correctable error data in the volatile storage medium is the case one, it can indicate that the probability of an uncorrectable error occurring in the volatile storage medium at this time is low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the number of correctable errors that occur in the volatile storage medium is higher than the second preset threshold, and the total number of access times of the volatile storage medium is lower than the fourth preset threshold, correctable errors occur in the volatile storage medium If the time is within the preset time range, and the address of the correctable error data in the volatile storage medium is Case 2, it can indicate that the probability of an uncorrectable error occurring in the volatile storage medium at this time is relatively high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the actual value or value range of any one or more of the first preset threshold, the second preset threshold, the third preset threshold, or the fourth preset threshold depends on the specific volatility storage medium.
  • the actual value or value range of each preset threshold may be the same or different, which is not limited by this embodiment of the present application.
  • the prediction model may be a mapping relationship between the work information set of the volatile storage medium and the risk assessment result.
  • the prediction model may be a model obtained through machine learning training according to the training data set.
  • the training data set may include a work information set of a volatile storage medium, a risk assessment result, and a mapping relationship between the work information set and the risk assessment result.
  • the training data set may also include the cause of the failure, the mapping relationship between the work information set and the cause of the failure, and the mapping relationship between the cause of the failure and the risk assessment result.
  • the computing device may obtain a trained prediction model.
  • the computing device may obtain a training data set, and train the model according to the training data set, so as to obtain a trained prediction model.
  • the prediction model may include a first prediction model and a second prediction model.
  • the computing device can determine the cause of the failure of the volatile storage medium according to the work information set and the first prediction model.
  • the computing device may also determine a risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the failure cause and the second prediction model. For a specific manner, refer to the description of FIG. 3 .
  • the computing device may determine the error characteristic set of the volatile storage medium according to the working information set of the volatile storage medium, the number of correctable errors occurred, and the statistical period of the working information set.
  • the error feature set of the volatile storage medium may include any one or more of the following information: error rate, the number of correctable errors occurring per unit time, or the storage unit of the correctable error in the volatile storage medium distribution in .
  • the computing device can determine the cause of the failure of the volatile storage medium according to the first predictive model and the set of error features of the volatile storage medium.
  • the computing device may also determine a risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the failure cause of the volatile storage medium and the second prediction model. For a specific manner, refer to the description of FIG. 4 .
  • the computing device may perform a logic operation on correctable error data included in each piece of work information and correct data corresponding to the error data, to obtain an operation result corresponding to each piece of work information.
  • the computing device can determine the risk assessment result based on the uncorrectable error model, the calculation result and the prediction model corresponding to each piece of work information. For a specific manner, reference may be made to the description of FIG. 5 .
  • the computing device can determine the risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the work information set and the prediction model of the volatile storage medium, thereby judging the health status of the volatile storage medium.
  • the computing device can guide the user to replace it according to the health state of the volatile storage medium, so as to avoid affecting the normal operation of the computing device or the volatile storage medium.
  • Fig. 3 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium, and the method in Fig. 3 includes the following steps.
  • the computing device may determine the cause of the failure of the volatile storage medium according to the work information set of the volatile storage medium obtained in S210 and the first prediction model.
  • the computing device may directly determine the cause of the failure of the volatile storage medium according to the work information set of the volatile storage medium and the first prediction model.
  • the computing device may determine the cause of the failure of the volatile storage medium according to any one or more pieces of information included in each piece of work information in the work information set.
  • the cause of the failure of the volatile storage medium may include any one or more of the following: capacitor leakage, word line (word line, WL) failure, sub-word line driver (sub-word driver, SWD) failure, main Word line driver (main-word driver, MWD) failure, bit line (bit line, BL) failure, sense amplifier (sense amplifier, SA) failure, storage matrix (bank) control circuit failure, poor contact, or signal margin ( margin) is insufficient, etc.
  • the failure causes of the volatile storage medium include SWD failure, SA failure, MWD failure, WL failure, BL failure, or capacitor leakage.
  • the cause of the failure of the volatile storage medium includes bank control circuit failure, poor contact, or insufficient margin.
  • each erroneous data belongs to the same DQ, and each erroneous data belongs to the same storage matrix in the corresponding DQ, and belongs to the same storage column and the same storage row in the corresponding storage matrix.
  • each erroneous data belongs to a different DQ
  • each erroneous data belongs to a different storage matrix in the corresponding DQ
  • each erroneous data belongs to a different storage column or a different storage row in the corresponding storage matrix. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the correctable error occurs in the volatile storage medium exceeds the preset time range, it can be determined that the cause of the failure of the volatile storage medium includes WL failure, BL failure, capacitor leakage, poor contact, or insufficient margin.
  • the cause of the failure of the volatile storage medium includes SWD failure, SA failure, MWD failure, or bank control circuit failure.
  • the cause of the failure of the volatile storage medium at this time includes WL failure, BL failure, capacitor leakage, poor contact, or margin insufficient.
  • the cause of the failure of the volatile storage medium at this time includes SWD failure, SA failure, MWD failure, bank control circuit failure, Poor contact or insufficient margin.
  • the volatile storage medium Causes of failure include WL failure, BL failure, capacitor leakage or poor contact.
  • the volatile storage medium For example, if the number of correctable errors that occur in the volatile storage medium is higher than the second preset threshold, and the total number of access times of the volatile storage medium is lower than the fourth preset threshold, then it can be determined that the volatile storage medium
  • the cause of the failure includes SWD failure, SA failure, MWD failure, or bank control circuit failure.
  • the volatile storage medium Causes of failure include capacitor leakage, poor contact, or insufficient margin.
  • the failure causes include SWD failure, SA failure, MWD failure, or bank control circuit failure.
  • the volatile storage medium For example, if the number of correctable errors that occur in the volatile storage medium is lower than the first preset threshold, and the address of the error data that can correct errors in the volatile storage medium is condition 1, then it can be determined that the volatile storage medium Common causes of failure include capacitor leakage or poor contact.
  • the volatile storage medium For example, if the number of correctable errors in the volatile storage medium is higher than the second preset threshold, and the address of the error data in the volatile storage medium is the second condition, then it can be determined that the volatile storage medium Common causes of failure include SA failure, MWD failure, or bank control circuit failure.
  • the number of correctable errors in the volatile storage medium is lower than the first preset threshold, the total number of access times of the volatile storage medium is higher than the third preset threshold, and the error data that can correct errors is in the volatile If the address in the volatile storage medium is in case one, it can be determined that the cause of the failure of the volatile storage medium includes capacitor leakage or poor contact.
  • the number of correctable errors in the volatile storage medium is higher than the second preset threshold, the total number of access times of the volatile storage medium is lower than the fourth preset threshold, and the error data with correctable errors is in the volatile storage medium. If the address in the volatile storage medium is in case 2, it can be determined that the cause of the failure of the volatile storage medium includes SA failure, MWD failure, or bank control circuit failure.
  • the time for correctable errors in the volatile storage medium exceeds the preset time range, and the error data with correctable errors is in the volatile storage medium. If the address in the volatile storage medium is in case one, it can be determined that the cause of the failure of the volatile storage medium includes capacitor leakage or poor contact.
  • the time for correctable errors in the volatile storage medium is within the preset time range, and the error data with correctable errors is within the volatile storage medium. If the address in the volatile storage medium is in case 2, it can be determined that the cause of the failure of the volatile storage medium includes SA failure, MWD failure, or bank control circuit failure.
  • the volatile storage medium For example, if the number of correctable errors in the volatile storage medium is lower than the first preset threshold, the total number of access times of the volatile storage medium is higher than the third preset threshold, and the volatile storage medium has correctable errors. If the error time exceeds the preset time range, it can be determined that the cause of the failure of the volatile storage medium includes capacitor leakage.
  • the volatile storage medium has correctable errors. If the error time is within the preset time range, it can be determined that the cause of the failure of the volatile storage medium includes SWD failure, SA failure, MWD failure, or bank control circuit failure.
  • the cause of the failure of the volatile storage medium includes capacitor leakage.
  • the first prediction model may be a mapping relationship between work information sets of volatile storage media and failure causes.
  • the first prediction model may be a model obtained through machine learning training according to the training data set.
  • the training data set may include a work information set of the volatile storage medium, a fault cause, and a mapping relationship between the work information set and the fault cause.
  • the computing device may obtain the trained first prediction model.
  • the computing device may obtain a training data set, and train the model according to the training data set, so as to obtain the trained first prediction model.
  • the computing device may determine the error characteristic set of the volatile storage medium according to the working information set of the volatile storage medium, the number of correctable errors occurred, and the statistical period of the working information set.
  • the set of error characteristics may include any one or more of the following information: error rate, number of correctable errors occurring per unit time, or distribution of correctable errors in storage units in the volatile storage medium.
  • the computing device may also determine the cause of the failure of the volatile storage medium according to the first prediction model and the set of error characteristics of the volatile storage medium. For a specific manner, refer to the description of FIG. 4 .
  • the computing device can judge the severity of the failure of the volatile storage medium according to the cause of the failure of the volatile storage medium and the second prediction model, so as to determine the risk assessment result of the uncorrectable error occurring in the volatile storage medium.
  • the second prediction model may be a mapping relationship between failure causes and risk assessment results.
  • the second predictive model may be a model obtained through machine learning training according to the training data set.
  • the training data set may include failure causes, risk assessment results, and a mapping relationship between failure causes and risk assessment results.
  • the computing device may obtain a trained second prediction model.
  • the computing device may obtain a training data set, and train the model according to the training data set, so as to obtain a trained second prediction model.
  • the cause of the failure of the volatile storage medium includes capacitor leakage, it may indicate that the current failure of the volatile storage medium is relatively minor, and the probability of an uncorrectable error occurring on the volatile storage medium at this time is relatively low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the cause of the failure of the volatile storage medium includes any one or more of WL failure, BL failure, poor contact, or insufficient margin, it may indicate that the severity of the current failure of the volatile storage medium is moderate , at this time, the probability of uncorrectable errors in volatile storage media is moderate. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is moderate.
  • the cause of the failure of the volatile storage medium includes any one or more of SWD failure, SA failure, MWD failure, or bank control circuit failure, it may indicate that the current failure of the volatile storage medium is relatively serious , at this time, the probability of uncorrectable errors occurring in the volatile storage medium is relatively high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the computing device may determine the risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the failure cause of the volatile storage medium and the risk assessment table.
  • the risk assessment form is shown in Table 1.
  • Table 1 is used to indicate the correspondence between each fault cause and the risk assessment result.
  • the computing device determines that there are multiple failure causes of the volatile storage medium, it may be determined that the volatile storage medium is unavailable according to the risk assessment result with the highest level among the risk assessment results corresponding to each failure cause. Correct erroneous risk assessment results.
  • the failure causes of the volatile storage medium include capacitor leakage, poor contact, and bank control circuit failure
  • the risk assessment result with the highest level among the risk assessment results corresponding to each failure cause is high risk, so the easy The probability of uncorrectable errors in volatile storage media is high. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the risk assessment result of an uncorrectable error occurring in the volatile storage medium may be determined according to the occurrence probability of the failure cause.
  • the failure cause of the volatile storage medium includes capacitor leakage or poor contact, and the probability of capacitor leakage is high, it can be determined that the probability of uncorrectable errors occurring in the volatile storage medium is low. That is to say, at this time, the risk of uncorrectable errors occurring on the volatile storage medium is relatively low.
  • the volatile storage medium can be determined to be The probability of an uncorrectable error is moderate. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is moderate.
  • the causes of the failure of the volatile storage medium include SWD failure, SA failure, MWD failure, bank control circuit failure, poor contact, and insufficient margin, and the probability of occurrence of more serious failures is relatively high, it can be determined that the volatile Permanent storage media have a higher probability of uncorrectable errors. That is to say, at this time, the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the volatile storage medium when the computing device determines that there are multiple failure causes of the volatile storage medium, and the risk assessment results corresponding to each failure cause are the same, the volatile storage medium may be uncorrectable. Wrong risk assessment results are identified as higher-level risk assessment results.
  • the computing device determines that the cause of the failure of the volatile storage medium includes a WL failure and a BL failure, it may determine that the risk assessment result of an uncorrectable error occurring in the volatile storage medium is a high risk.
  • the risk of an uncorrectable error occurring in the volatile storage medium is high, the health status of the volatile storage medium is poor and needs to be replaced.
  • the risk of an uncorrectable error occurring on a volatile storage medium is low, the volatile storage medium is in good health and does not need to be replaced.
  • the computing device can determine the cause of the failure of the volatile storage medium according to the work information set of the volatile storage medium and the first prediction model. And the computing device can determine the risk assessment result of the uncorrectable error occurring in the volatile storage medium according to the failure cause of the volatile storage medium and the second prediction model. The computing device can judge the health status of the volatile storage medium according to the risk assessment result of uncorrectable errors in the volatile storage medium, so as to guide the user to replace it, so as to avoid affecting the normal operation of the computing device or the volatile storage medium.
  • Fig. 4 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium, and the method in Fig. 4 includes the following steps.
  • S410 Determine an error signature set of the volatile storage medium according to the work information set, the number of correctable errors that occur, and the duration of the statistics period.
  • the computing device may determine the number of correctable errors that occur in the volatile storage medium according to the number of pieces of work information included in the work information set of the volatile storage medium obtained in S210.
  • each piece of work information in the work information set of the volatile storage medium includes an address of the error data that can correct the error in the volatile storage medium, and the work information set further includes the total
  • the computing device can determine the set of error characteristics of the volatile storage medium according to the set of work information, the number of correctable errors and the length of the statistical period.
  • the statistical period is the statistical period of the work information collection.
  • the set of error characteristics may include any one or more of the following information: error rate, number of correctable errors occurring per unit time, or distribution of correctable errors in storage units in the volatile storage medium.
  • the computing device may obtain the error feature set of the volatile storage medium in real time and continuously according to the work information set of the volatile storage medium, or periodically obtain the error feature set of the volatile storage medium gather.
  • the computing device may obtain the error characteristic set of the volatile storage medium after the nth correctable error occurs in the volatile storage medium, where n is a preset threshold.
  • the computing device may obtain the error feature set of the volatile storage medium after receiving the obtaining instruction, which is not limited in this embodiment of the present application.
  • the computing device may determine the storage unit in the volatile storage medium where the correctable error occurred within the statistical period according to the address of the error data in the volatile storage medium included in each piece of work information of the distribution.
  • the storage unit may include any one or more of the following: a storage matrix, a storage row, a storage column, a storage block, or DQ. That is to say, the distribution situation may include any of the identifiers of the storage matrix, the identifiers of the storage rows, the identifiers of the storage columns, the identifiers of the storage blocks, or the identifiers of the DQs to which the address of the error-correctable error data belongs. Whether one or more are the same.
  • distribution of correctable errors in the volatile storage medium is distribution 1 in Table 2, it may indicate that only one error occurs in the volatile storage medium.
  • distribution situation 1 may indicate that multiple correctable errors have occurred in the volatile storage medium, and the error data of multiple correctable errors is only distributed in one DQ, the identification of the rank to which each error data belongs, each error The identity of the bank to which the data belongs in the rank, the identity of the row to which each error data belongs in the bank to which it belongs, and the identity of the column to which it belongs are the same.
  • the distribution of correctable errors in the volatile storage medium is distribution 10 in Table 2, it may indicate that multiple correctable errors have occurred in the volatile storage medium.
  • the error data of the plurality of correctable errors is distributed in multiple DQs, and the identity of the rank to which each error data belongs is the same, and the identity of the bank to which each error data belongs in the rank is different, and each error data is in the bank to which it belongs The identity of the row or the identity of the column is different.
  • the computing device may determine the error rate of the volatile storage medium according to the total number of accesses to the volatile storage medium and the number of correctable errors that occur in the volatile storage medium.
  • the error rate of the volatile storage medium may include the error rate of one or more storage matrices.
  • the error rate of each storage matrix may be the ratio of the number of correctable errors occurring in each storage matrix to the total number of access times of each storage matrix.
  • the computing device may determine the number of correctable errors that occur on the volatile storage medium per unit time according to the number of correctable errors that occur on the volatile storage medium and the duration of the statistics period of the collection of work information.
  • the computing device may obtain the number of correctable errors that occur on the volatile storage medium within the first time range, so as to determine the number of correctable errors that occur on the volatile storage medium per unit time.
  • the first time range may be the difference between the time when the correctable error occurs in the volatile storage medium and the time when the correctable error occurs in the volatile storage medium ends.
  • the first time range may be a statistical period.
  • the computing device may obtain the number of correctable errors that occur on the volatile storage medium within the second time range, so as to determine the number of correctable errors that occur on the volatile storage medium per unit time.
  • the second time range may be a difference between the time when the first error occurs and the time when the second error occurs in the volatile storage medium.
  • the first error and the second error are correctable errors that do not occur at the same time for any two of the correctable errors that occur on the volatile storage medium, and the time when the first error occurs on the volatile storage medium is earlier than that on the volatile storage medium. The time at which the second error occurred on the storage medium.
  • the number of correctable errors that occur on the volatile storage medium per unit time may include the number of correctable errors that occur on one or more storage matrices per unit time.
  • the number of correctable errors that occur in each storage matrix per unit time can be the ratio of the number of correctable errors that occur in each storage matrix to the time range.
  • the time range may be the first time range or the second time range, which is not limited in this embodiment of the present application.
  • the computing device can determine the cause of the failure of the volatile storage medium according to the error feature set and the first prediction model of the volatile storage medium obtained in step S410.
  • the computing device may directly determine the failure cause of the volatile storage medium according to the error feature set of the volatile storage medium and the first prediction model.
  • the set of error characteristics may include any one or more of the following information: error rate, number of correctable errors occurring per unit time, or distribution of correctable errors in storage units in the volatile storage medium.
  • the error rate of the volatile storage medium may indicate that the error rate of the volatile storage medium is relatively low. If the error rate of the volatile storage medium is higher than the fifth preset threshold, and the error rate is lower than the sixth preset threshold, it may indicate that the error rate of the volatile storage medium is moderate. If the error rate of the volatile storage medium is higher than the sixth preset threshold, it may indicate that the error rate of the volatile storage medium is relatively high.
  • the fifth preset threshold may be a positive number greater than or equal to 0 and less than 0.2. For example, the fifth preset threshold may be 0.01, 0.1, 0.15 and so on.
  • the sixth preset threshold may be a positive number greater than or equal to 0.4 and less than or equal to 1.
  • the sixth preset threshold may be 0.5, 0.6, 0.7 and so on. If the fifth preset threshold or the sixth preset threshold is set larger, it is possible that an uncorrectable error has occurred on the volatile storage medium before the risk assessment result of an uncorrectable error occurring on the volatile storage medium is determined, that is, The lower the accuracy of the risk assessment results for determining the occurrence of uncorrectable errors on volatile storage media. If the fifth preset threshold or the sixth preset threshold is set smaller, it is possible to determine the risk assessment of an uncorrectable error occurring in the volatile storage medium when the probability of an uncorrectable error occurring in the volatile storage medium is low The result is medium risk or high risk, that is, the lower the accuracy of the risk assessment result in determining the occurrence of uncorrectable errors on volatile storage media.
  • the number of correctable errors that occur on the volatile storage medium per unit time is lower than the seventh preset threshold, it may indicate that the number of correctable errors that occur on the volatile storage medium per unit time is relatively small. Low. If the number of correctable errors that occur on the volatile storage medium per unit time is higher than the seventh preset threshold, and the number of correctable errors that occur in this unit time is lower than the eighth preset threshold, it may indicate that the The number of correctable errors per unit time of volatile storage media is moderate. If the number of correctable errors that occur on the volatile storage medium per unit time is higher than the eighth preset threshold, it may indicate that the number of correctable errors that occur on the volatile storage medium per unit time is relatively high.
  • the seventh preset threshold may be a positive integer greater than 10 and less than or equal to 40.
  • the seventh preset threshold may be 15, 20, 25 and so on.
  • the eighth preset threshold may be a positive integer greater than 70 and less than or equal to 100.
  • the seventh preset threshold may be 75, 80, 85 and so on. If the seventh preset threshold or the eighth preset threshold is set larger, it is possible that an uncorrectable error has occurred on the volatile storage medium before the risk assessment result of an uncorrectable error occurring on the volatile storage medium is determined, that is, The lower the accuracy of the risk assessment results for determining the occurrence of uncorrectable errors on volatile storage media.
  • the seventh preset threshold or the eighth preset threshold is set smaller, it is possible to determine the risk assessment of an uncorrectable error occurring in the volatile storage medium when the probability of an uncorrectable error occurring in the volatile storage medium is low
  • the result is medium risk or high risk, that is, the lower the accuracy of the risk assessment result in determining the occurrence of uncorrectable errors on volatile storage media.
  • the actual value or value range of any one or more of the fifth preset threshold, the sixth preset threshold, the seventh preset threshold, or the eighth preset threshold depends on the specific volatility storage medium.
  • the actual value or value range of each preset threshold may be the same or different, which is not limited by this embodiment of the present application.
  • the error rate of the volatile storage medium is low, it can be directly determined that the cause of the failure of the volatile storage medium includes WL failure, BL failure, capacitor leakage or insufficient margin.
  • the error rate of the volatile storage medium is high, it can be directly determined that the failure causes of the volatile storage medium include SWD failure, SA failure, MWD failure, bank control circuit failure, or poor contact.
  • the proportion of correctable errors that occur in a volatile storage medium per unit time is low, it can be directly determined that the failure causes of the volatile storage medium include WL failure, BL failure, capacitor leakage, poor contact, Or the margin is insufficient.
  • the cause of the failure of the volatile storage medium includes SWD failure, SA failure, MWD failure, or bank control circuit failure.
  • the distribution of correctable errors in the volatile storage medium is distribution 1 in Table 2, it can be directly determined that the cause of the failure of the volatile storage medium includes WL failure or capacitor leakage.
  • the distribution of correctable errors in the volatile storage medium is the distribution of 10 in Table 2, it can be directly determined that the cause of the failure of the volatile storage medium includes SA failure, MWD failure, or bank control circuit Fault.
  • the failure of the volatile storage medium can be directly determined Causes include WL failure or capacitor leakage.
  • the failure of the volatile storage medium can be directly determined Causes include SA failure, MWD failure, or bank control circuit failure.
  • the cause of failure of the volatile storage medium includes WL failure or capacitor leakage.
  • the cause of failure of the volatile storage medium includes SA failure, MWD failure, or bank control circuit failure.
  • the error rate of the volatile storage medium is low, and the number of correctable errors that occur in the volatile storage medium per unit time is low, then it can be directly determined that the cause of the failure of the volatile storage medium includes Capacitor leakage.
  • the cause of the failure of the volatile storage medium includes SWD failure, SA failure, MWD failure, or bank control circuit failure.
  • the computing device may use any one of the error rate of the volatile storage medium, the number of correctable errors that occur per unit time, or the distribution of correctable errors in the storage units in the volatile storage medium or more, determine the failure cause of the volatile storage medium from the failure cause table.
  • the fault cause of the volatile storage medium is capacitor leakage.
  • the fault cause of the volatile storage medium is SWD fault.
  • step S430 Determine a risk assessment result according to the cause of the failure and the second prediction model.
  • the specific implementation manner of step S430 is similar to that of step S320 and will not be repeated here.
  • the computing device can determine the error feature set of the volatile storage medium according to the work information set of the volatile storage medium.
  • the computing device can determine the cause of the failure of the volatile storage medium according to the set of error characteristics of the volatile storage medium.
  • the computing device may also determine a risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the failure cause of the volatile storage medium and the second prediction model. Therefore, the computing device can determine the health status of the volatile storage medium, thereby instructing the user to replace it, so as to avoid affecting the normal operation of the computing device or the volatile storage medium.
  • Fig. 5 is a schematic flowchart of a method for predicting uncorrectable errors of a volatile storage medium, and the method in Fig. 5 includes the following steps.
  • the computing device can perform logic operations on the correctable error data included in each piece of work information in the work information set and the correct data corresponding to the error data, and obtain the operation result of the error data and the correct data.
  • the logical operation may be any one of logical operations such as an exclusive OR operation, an exclusive OR operation, an AND operation, or an OR operation.
  • the computing device may obtain the work information set of the volatile storage medium.
  • Each piece of work information in the set of work information may include correctable error data.
  • the computing device can obtain correct data corresponding to the erroneous data according to the error correction algorithm of the volatile storage medium and the erroneous data.
  • each piece of work information in the set of work information may include correctable error data and correct data, and the error data corresponds to the correct data.
  • each error correction algorithm may have certain limitations, that is, for each error correction algorithm, there may be one or more data that cannot be corrected by the error correction algorithm.
  • data that cannot be corrected by the error correction algorithm can be used as an uncorrectable error model.
  • the error correction principle of each error correction algorithm is to perform calculations on correct data and wrong data according to the operation rules, thereby realizing the error correction function. Therefore, correctable error correctable data and erroneous data can be calculated according to similar calculation rules to obtain the calculation result, and by comparing the similarity between the calculation result and the uncorrectable error model, it can be determined that the volatile storage medium has an uncorrectable error the results of the risk assessment.
  • the computing device may obtain an uncorrectable error model of the volatile storage medium.
  • the uncorrectable error model is the data determined according to the error correction principle of the error correction algorithm of the volatile storage medium.
  • the computing device may compare the uncorrectable error model with the calculation result corresponding to each piece of work information to obtain the similarity corresponding to each piece of work information.
  • the computing device can also determine the risk assessment result according to the similarity and prediction model corresponding to each piece of work information.
  • the computing device can obtain the number of data bits with the same data by comparing the operation result corresponding to each piece of work information with the data of each data bit in the uncorrectable error model, and use the number of data bits as the number of data bits for each piece of work information. The similarity of the corresponding information.
  • the computing device may use the number of data bits in which the operation result corresponding to each piece of work information and the data of the uncorrectable error model are 1 at the same time as the similarity corresponding to each piece of work information.
  • the computing device may use the number of data bits in which the operation result corresponding to each piece of work information and the data of the uncorrectable error model to be 0 at the same time as the similarity corresponding to each piece of work information.
  • the error correction algorithm of the volatile storage medium is ECC
  • the uncorrectable error model of the ECC is 1101101111010000
  • the operation rule of the error correction principle of ECC is exclusive OR operation. If any three pieces of work information in the work information set of the volatile storage medium include correctable error data, correct data corresponding to the error data, XOR operation results corresponding to each work information, and each work information
  • Table 4 The similarities corresponding to the information are shown in Table 4.
  • the work information set of the volatile storage medium includes M pieces of work information
  • the m-th piece of work information includes correctable error data, correct data corresponding to the error data, and the m-th piece of work information.
  • the XOR operation result corresponding to the work information is the first line of data in Table 4, it can be determined that the XOR operation result has a high similarity with the uncorrectable error model, that is, the correctable error cannot be corrected by the error correction algorithm higher probability.
  • m 1, ... M, M is a positive integer greater than or equal to 1.
  • the mth piece of work information of the volatile storage medium includes correctable error data
  • the correct data corresponding to the error data, and the XOR operation result corresponding to the m piece of work information is as shown in Table 4
  • the XOR operation result has a low similarity with the uncorrectable error model, that is, the probability that the correctable error cannot be corrected by the error correction algorithm is low.
  • the prediction model may be a mapping relationship between the similarity corresponding to each piece of work information and the risk assessment result.
  • the prediction model may be a model obtained through machine learning training according to the training data set.
  • the training data set may include a similarity degree corresponding to each piece of work information, a risk assessment result, and a mapping relationship between the similarity degree corresponding to each piece of work information and the risk assessment result.
  • the computing device may obtain a trained prediction model.
  • the computing device may obtain a training data set, and train the model according to the training data set, so as to obtain a trained prediction model.
  • the m-th piece of work information corresponds to a higher similarity, it can mean that the correctable error corresponding to the m-th piece of work information has a higher probability of being unable to be corrected by the error correction algorithm, that is, at this time, the volatile storage The media has a high risk of uncorrectable errors.
  • the similarity corresponding to the m-th piece of work information is low, it can mean that the probability that the correctable error corresponding to the m-th piece of work information cannot be corrected by the error correction algorithm is low, that is, at this time, the volatile storage medium is uncorrectable The risk of error is low.
  • the mth piece of work information of the volatile storage medium includes correctable error data
  • the correct data corresponding to the error data, and the XOR operation result corresponding to the m piece of work information is as shown in Table 4
  • the first line of data it can indicate that the correctable error corresponding to the m-th piece of work information has a higher probability that the error cannot be corrected by the error correction algorithm. That is to say, at this time, it can be determined that the risk of an uncorrectable error occurring on the volatile storage medium is relatively high.
  • the mth piece of work information of the volatile storage medium includes the correctable error data, the correct data corresponding to the erroneous data, and the XOR operation result corresponding to the mth piece of work information as shown in Table 4
  • the data in the third row of it can indicate that the probability that the correctable error corresponding to the m-th piece of work information cannot be corrected by the error correction algorithm is low. That is to say, it can be determined at this time that the risk of uncorrectable errors occurring in the volatile storage medium is low.
  • the computing device may compare the similarity corresponding to each piece of work information with a ninth preset threshold, so as to determine a risk assessment result of an uncorrectable error occurring in the volatile storage medium.
  • the ninth preset threshold may be a positive integer greater than or equal to 10 and less than or equal to 16.
  • the ninth preset threshold may be 11, 12, 13 and so on. If the ninth preset threshold is set larger, it may be determined that an uncorrectable error has occurred in the volatile storage medium before the risk assessment result that an uncorrectable error occurs in the volatile storage medium, that is, it is determined that the volatile storage medium The lower the accuracy of the risk assessment results for uncorrectable errors.
  • the ninth preset threshold is set smaller, it may be determined that the risk assessment result of an uncorrectable error occurring in the volatile storage medium is a high risk when the probability of an uncorrectable error occurring in the volatile storage medium is low, that is, The lower the accuracy of the risk assessment results for determining the occurrence of uncorrectable errors on volatile storage media.
  • the actual value or value range of the ninth preset threshold depends on any one or more of the following: a volatile storage medium, an error correction algorithm, or the number of data bits for reading and writing data.
  • a volatile storage medium an error correction algorithm
  • the actual value or value range of the ninth preset threshold may be the same or different, which is not limited by this embodiment of the present application.
  • the similarity corresponding to the m-th piece of work information is less than the ninth preset threshold, it can indicate that the probability that the correctable error corresponding to the m-th piece of work information cannot be corrected by the error correction algorithm is low, that is, the volatile The risk of uncorrectable errors on non-volatile storage media is low.
  • the similarity corresponding to the m-th piece of work information is greater than the ninth preset threshold, it may indicate that the correctable error corresponding to the m-th piece of work information has a high probability that the error correction algorithm cannot be corrected, that is, the volatile Permanent storage media have a higher risk of uncorrectable errors.
  • the computing device may determine the risk assessment result corresponding to each piece of work information according to the similarity and prediction model corresponding to each piece of work information.
  • the computing device may also use the risk assessment result with the highest level as the risk assessment result of an uncorrectable error occurring in the volatile storage medium.
  • the work information set of the volatile storage medium includes 10 pieces of work information. If the risk assessment result corresponding to 1 piece of work information among the 10 pieces of work information is high risk, it may be determined that the risk assessment result with the highest level among the risk assessment results corresponding to the 10 pieces of work information is high risk. That is to say, the risk assessment result of uncorrectable errors occurring in the volatile storage medium is a high risk.
  • the computing device may determine the risk assessment result corresponding to each piece of work information according to the similarity and prediction model corresponding to each piece of work information.
  • the computing device may also use the risk assessment result with the highest occurrence frequency as the risk assessment result of an uncorrectable error occurring in the volatile storage medium.
  • the work information set of the volatile storage medium includes 10 pieces of work information. If the risk assessment results corresponding to 8 of the 10 job information are low risk, and the risk assessment results corresponding to 2 job information are medium risk, then it can be determined that the risk assessment results corresponding to the 10 job information have the highest frequency of occurrence
  • the risk assessment result of is low risk, that is, the risk assessment result of uncorrectable errors in the volatile storage medium is low risk.
  • the risk assessment results corresponding to 8 pieces of work information out of 10 pieces of work information are high risks, and the risk assessment results corresponding to 2 pieces of work information are medium risks, then it can be determined that the risk assessment results corresponding to the 10 pieces of work information
  • the risk assessment result with the highest frequency is a high risk, that is, the risk assessment result of an uncorrectable error occurring in the volatile storage medium is a high risk.
  • the computing device can obtain the operation result of error data and correct data when a correctable error occurs in the volatile storage medium according to the work information set of the volatile storage medium.
  • the computing device can also obtain an uncorrectable error model, and determine a risk assessment result of an uncorrectable error occurring on the volatile storage medium according to the uncorrectable error model, the calculation result and the prediction model. Therefore, the computing device can determine the health status of the volatile storage medium, thereby instructing the user to replace it, so as to avoid affecting the normal operation of the computing device or the volatile storage medium.
  • Fig. 6 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • the computing device 600 includes an acquisition module 610 and a processing module 620 .
  • the acquiring module 610 is configured to acquire the work information set of the volatile storage medium in the storage device.
  • the obtaining module 610 may execute step S210 in the method of FIG. 2 .
  • the processing module 620 is configured to determine the risk assessment result of an uncorrectable error occurring in the volatile storage medium according to the work information set and the prediction model.
  • the processing module 620 may execute steps S220 in the method of FIG. 2 , steps S310 and S320 in the method of FIG. 3 , steps S410 to S430 in the method of FIG. 4 , part of steps S510 and S520 in the method of FIG. All steps.
  • the embodiment of the present application also provides a computing device, the computing device includes a processor, the processor is configured to be coupled with a memory, and read and execute instructions and/or program codes in the memory, so as to execute the steps shown in Figures 2 to 5 each step.
  • the embodiment of the present application also provides a chip system, the chip system includes a logic circuit, the logic circuit is used to couple with the input/output interface, and transmit data through the input/output interface, so as to execute each of the steps in Figure 2 to Figure 5 step.
  • the present application also provides a computer program product, the computer program product including: computer program code, when the computer program code is run on the computer, the computer is made to execute the steps shown in Figures 2 to 5. each step.
  • the present application also provides a computer-readable medium, the computer-readable medium stores program codes, and when the program codes are run on a computer, the computer executes the steps shown in Figures 2 to 5. each step.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only volatile memory (read-only memory, ROM), random access volatile memory (RAM), magnetic disk or optical disk, etc. can store program codes. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

本申请实施例提供了一种易失性存储介质不可纠正错误的预测方法和相关设备。该方法包括:获取存储设备中的易失性存储介质的工作信息集合,其中,所述工作信息集合中包括所述易失性存储介质发生的可纠正错误的信息。根据工作信息集合与预测模型确定易失性存储介质发生不可纠正错误的风险评估结果。该方法可以根据存储设备中的易失性存储介质发生的可纠正错误的信息与预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果,从而判断该易失性存储介质的健康状态。

Description

一种易失性存储介质不可纠正错误的预测方法和相关设备 技术领域
本申请实施例涉及存储器领域,主要涉及一种易失性存储介质不可纠正错误的预测方法、计算装置、计算设备、芯片系统和计算机可读存储介质。
背景技术
随着存储设备中的易失性存储介质工作频率的提高,容量的增大,易失性存储介质发生错误的问题越来越突出。易失性存储介质发生的错误可以分为可纠正错误和不可纠正错误。当易失性存储介质发生可纠正错误时,计算装置可以及时对错误进行纠正,因此可纠正错误对存储设备或计算装置的影响较小,此时易失性存储介质的健康状态较好。该存储设备可以在该计算装置中,或者,该存储设备可以与该计算装置相连接。当易失性存储介质发生不可纠正错误时,该计算装置无法对错误进行纠正,会导致存储设备或计算装置的工作中断,甚至会引起计算装置的宕机,此时易失性存储介质的健康状态较差。
因此,如何预测易失性存储介质的不可纠正错误,从而判断易失性存储介质的健康状态成为亟待解决的问题。
发明内容
本申请实施例提供一种易失性存储介质不可纠正错误的预测方法、计算装置、计算设备、芯片系统和计算机可读存储介质,可以预测易失性存储介质的不可纠正错误,从而判断易失性存储介质的健康状态。
第一方面,提供了一种易失性存储介质不可纠正错误的预测方法,该方法包括:获取存储设备中的易失性存储介质的工作信息集合;根据工作信息集合与预测模型,确定易失性存储介质发生不可纠正错误的风险评估结果。
应理解,该存储设备可以在计算设备中或与计算设备相连接。该存储设备可以是存储介质,例如内存或缓存。或者,该存储设备还可以包括非易失性存储介质,例如固态硬盘等,易失性存储介质可以是固态硬盘中的高速缓冲存储器(cache)。
还应理解,该工作信息集合包括易失性存储介质发生的可纠正错误的信息,该可纠正错误的信息包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在易失性存储介质中的地址、或可纠正错误的错误数据。
还应理解,可以根据易失性存储介质的工作信息集合与预测模型,直接确定该易失性存储介质发生不可纠正错误的风险评估结果。或者,可以根据易失性存储介质的工作信息集合与预测模型中的第一预测模型,确定该易失性存储介质的故障原因,从而确定该易失性存储介质发生不可纠正错误的风险评估结果。
还应理解,易失性存储介质发生不可纠正错误的风险评估结果包括以下任一种:高风险、中风险、或低风险。若易失性存储介质发生不可纠正错误的风险评估结果为高风险,则表示该易失性存储介质的健康状态较差,需要进行更换。若易失性存储介质发生不可纠 正错误的风险评估结果为低风险,则表示该易失性存储介质的健康状态较好,不需要进行更换。
本申请实施例中,计算设备可以根据存储设备中的易失性存储介质发生的可纠正错误的信息与预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果,从而判断该易失性存储介质的健康状态。计算设备可以根据该易失性存储介质的健康状态,指导用户可以进行更换,避免影响存储设备或易失性存储介质的正常工作。
结合第一方面,在第一方面的某些实现方式中,预测模型包括第一预测模型和第二预测模型,根据该工作信息集合与第一预测模型,确定故障原因;根据该故障原因与第二预测模型,确定该风险评估结果。
应理解,可以根据易失性存储介质的工作信息集合与第一预测模型,直接确定易失性存储介质的故障原因。或者,可以根据易失性存储介质的工作信息集合,确定该易失性存储介质的错误特征集合,从而确定该易失性存储介质的故障原因。
本申请实施例中,计算设备可以根据易失性存储介质发生的可纠正错误的信息与第一预测模型,确定该易失性存储介质发生的具体的故障原因。并且计算设备可以根据该易失性存储介质的故障原因与第二预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。计算设备可以根据该易失性存储介质发生不可纠正错误的风险评估结果,判断该易失性存储介质的健康状态,从而指导用户进行更换,避免影响存储设备或易失性存储介质的正常工作。
在某些实现方式中,在工作信息集合中的每条工作信息包括可纠正错误的错误数据在易失性存储介质中的地址,且工作信息集合还包括易失性存储介质的总访问次数的情况下,根据工作信息集合中包括的工作信息的条数,确定发生可纠正错误的数量;根据该工作信息集合、发生可纠正错误的数量与工作信息集合的统计周期的时长,确定该易失性存储介质的错误特征集合;根据该错误特征集合和第一预测模型,确定故障原因。
应理解,错误特征集合包括以下信息中的任一个或多个:易失性存储介质的错误率、单位时间内发生的可纠正错误的数量、或可纠正错误在易失性存储介质中的存储单元中的分布情况。
还应理解,错误率为发生可纠正错误的数量与易失性存储介质的总访问次数之比。单位时间内发生的可纠正错误的数量为发生可纠正错误的数量与统计周期的时长之比。
还应理解,存储单元可以包括以下任一个或多个:存储矩阵(bank)、存储行(row)、存储列(column)、存储区块(rank)、或双向数据总线(data queue,DQ)。也就是说,该分布情况可以包括可纠正错误的地址所属存储矩阵的标识、所属存储行的标识、所属存储列的标识、所属存储区块的标识、或所属DQ的标识中的任一个或多个是否相同。
本申请实施例中,计算设备可以根据易失性存储介质发生的可纠正错误的信息,确定该易失性存储介质的错误特征集合,从而可以确定该易失性存储介质发生的具体的故障原因。计算设备还可以根据易失性存储介质的故障原因与第二预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。
结合第一方面,在第一方面的某些实现方式中,易失性存储介质的故障原因包括以下任一个或多个:电容漏电、字线故障、子字线驱动器故障、主字线驱动器故障、位线故障、敏感放大器故障、存储矩阵控制电路故障、接触不良、或信号裕量不足。
本申请实施例中,计算设备可以根据易失性存储介质的工作信息集合与第一预测模型, 确定该易失性存储介质发生的故障原因包括的具体类型,从而确定该易失性存储介质发生不可纠正错误的风险评估结果。
结合第一方面,在第一方面的某些实现方式中,在工作信息集合中的每条工作信息包括可纠正错误的错误数据的情况下,对每条工作信息包括的可纠正错误的错误数据与对应于错误数据的正确数据进行逻辑运算,获得每条工作信息对应的运算结果;根据不可纠正错误模型、每条工作信息对应的运算结果与预测模型,确定风险评估结果。
应理解,逻辑运算可以是异或运算、同或运算、与运算、或运算等逻辑运算中的任一种运算。不可纠正错误模型为根据易失性存储介质的纠错算法所确定的数据。
本申请实施例中,计算设备可以根据易失性存储介质发生的可纠正错误的错误数据与对应的正确数据,获得错误数据与正确数据的运算结果。计算设备还可以根据不可纠正错误模型、该运算结果与预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。计算设备可以根据该易失性存储介质发生不可纠正错误的风险评估结果,判断该易失性存储介质的健康状态,从而指导用户进行更换,避免影响存储设备或易失性存储介质的正常工作。
结合第一方面,在第一方面的某些实现方式中,对不可纠正错误模型与每条工作信息对应的运算结果进行比较,获得每条工作信息对应的相似度;根据每条工作信息对应的相似度与预测模型,确定每条工作信息对应的风险评估结果;将等级最高的风险评估结果作为易失性存储介质发生不可纠正错误的风险评估结果。
应理解,每条工作信息对应的相似度为每条工作信息对应的运算结果与不可纠正错误模型的相似度。
还应理解,若每条工作信息对应的的相似度较高,则可以表示该条工作信息中的可纠正错误无法被纠错算法纠错的概率较大,即可以确定该条工作信息对应的风险评估结果为高风险。若每条工作信息对应的相似度较低,则可以表示该条工作信息中的可纠正错误无法被纠错算法纠错的概率较小,即可以确定该条工作信息对应的风险评估结果为低风险。
本申请实施例中,计算设备可以获得每条工作信息对应的相似度,并可以根据每条工作信息对应的相似度,确定每条工作信息对应的风险评估结果,从而将其中等级最高的风险评估结果作为根据该易失性存储介质发生不可纠正错误的风险评估结果。计算设备可以根据该易失性存储介质发生不可纠正错误的风险评估结果,判断该易失性存储介质的健康状态,从而指导用户进行更换,避免影响存储设备或易失性存储介质的正常工作。
第二方面,提供了一种计算装置,该计算装置包括用于实现第一方面或第一方面的任一种可能的实现方式的模块。
第三方面,提供了一种计算设备,该计算设备包括处理器,该处理器用于与存储器耦合,读取并执行该存储器中的指令和/或程序代码,以执行第一方面或第一方面的任一种可能的实现方式。
第四方面,提供了一种芯片系统,该芯片系统包括逻辑电路,该逻辑电路用于与输入/输出接口耦合,通过该输入/输出接口传输数据,以执行第一方面或第一方面任一种可能的实现方式。
第五方面,提供了一种计算机可读存储介质,该计算机可读存储介质存储有程序代码,当该计算机存储介质在计算机上运行时,使得计算机执行如第一方面或第一方面的任一种可能的实现方式。
第六方面,本申请实施例提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行如第一方面或第一方面的任一种可能的实现方式。
附图说明
图1是计算设备的示意性系统架构图。
图2是根据本申请一个实施例的易失性存储介质不可纠正错误的预测方法的示意性流程图。
图3是根据本申请另一实施例的易失性存储介质不可纠正错误的预测方法的示意性流程图。
图4是根据本申请另一实施例的易失性存储介质不可纠正错误的预测方法的示意性流程图。
图5是根据本申请另一实施例的易失性存储介质不可纠正错误的预测方法的示意性流程图。
图6是根据本申请一个实施例的计算装置的结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
本申请实施例的技术方案可以应用于各种计算机系统,例如:32位的计算机系统、64位的计算机系统、进阶精简指令集机器(advanced reduced-instruction-set-computer machines,ARM)的计算机系统等,本申请实施例并不限定。
本申请实施例中的存储设备可以是易失性存储器,例如可以是内存、缓存、随机存取存储器(random access memory,RAM)、静态随机存取存储器(static random access memory,SRAM)、动态随机存取存储器(dynamic random access memory,DRAM)、同步动态随机存取内存(synchronous dynamic random access memory,SDRAM)、双列直插式存储模块(dual in-line memory module,DIMM)、无缓存的双列直插存储模块(unbuffered DIMM,UDIMM)、带寄存器的双列直插存储模块(registered DIMM,RDIMM)、负载降低的双列直插存储模块(load reduced DIMM,LRDIMM)、双倍数据传输率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、图形化双倍数据传输率同步动态随机存取存储器(graphics double data rate SDRAM,GDDR SDRAM)、低功耗双倍数据传输率同步动态随机存取存储器(low power double data rate SDRAM,LPDDR SDRAM)、高带宽内存(high bandwidth memory,HBM)等。或者,本申请实施例中的存储设备还可以是包括易失性存储介质和非易失性存储介质的存储器,例如可以是固态硬盘等。存储设备中的易失性存储介质可以为固态硬盘中的高速缓冲存储器(cache)。或者,本申请实施例中的存储设备可以是处理器或芯片系统(system on chip,SOC)的内核外的缓存。处理器可以是中央处理器(central processing unit,CPU)或图形处理器(graphics processing unit,GPU)等,该存储设备可以是一级缓存(level 1 cache,L1 cache)或二级缓存(level 2 cache,L2 cache)等,本申请实施例并不限定。
图1是计算设备100的示意性系统架构图。计算设备100可以包括处理器110、控制电路111、运算电路112、缓存控制器113、缓存114、内存控制器120、内存121、外部 存储器接口130、扬声器140以及显示屏150等。
可以理解的是,本申请实施例示意的结构并不构成对计算设备100的具体限定。在本申请另一些实施例中,计算设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110包括控制电路111、运算电路112、缓存控制器113与缓存114。在一些实施例中,计算设备100也可以包括一个或多个处理器110。处理器110可以是CPU或者GPU等。处理器110可以从易失性存储器的控制器中获得该易失性存储器的工作信息集合,从而确定该易失性存储器发生不可纠正错误的风险评估结果,以判断易失性存储器的健康状态。例如,处理器110可以从缓存控制器113中获得缓存114的工作信息集合,从而确定缓存114发生不可纠正错误的风险评估结果。或者,处理器110可以从内存控制器120中的硬件寄存器中获得内存121的工作信息集合,从而确定内存121发生不可纠正错误的风险评估结果,以判断内存的健康状态。处理器110还可以获得不可纠正错误模型(pattern),该不可纠正错误pattern是根据易失性存储器的纠错算法所确定的数据。处理器110可以根据该不可纠正错误pattern与该易失性存储器的工作信息集合,确定该易失性存储器发生不可纠正错误的风险评估结果。当易失性存储器发生不可纠正错误的概率较低时,易失性存储器的健康状态较优,此时不需要更换该易失性存储器。当易失性存储器发生不可纠正错误的概率较大时,易失性存储器的健康状态为差,此时需要更换该易失性存储器。该易失性存储器可以是缓存114,也可以是内存121。或者,该易失性存储器可以是通过外部存储器接口130与处理器110连接的易失性存储器或包括易失性存储介质的非易失性存储器。本申请实施例并不限定。
控制电路111可以包括指令寄存器、指令译码器和操作控制器。控制电路111可以根据预先设定的程序,从缓存114或内存121中获得一条或多条指令。控制电路111还可以根据获得的指令确定应该执行的操作,并向相应的部件发出微操作控制信号。
运算电路112可以根据来自于控制电路111的控制指令,从缓存114中获得数据,并进行算数或逻辑运算。
缓存114可以保存控制电路111刚用过或循环使用的指令或数据。如果控制电路111需要再次使用该指令或数据,可从缓存114中直接调用。这样就避免了重复存取,减少了控制电路111的等待时间,因而提高了计算设备100处理数据或执行指令的效率。缓存控制器113可以检测缓存是否发生错误,该错误可以是可纠正错误,也可以是不可纠正错误。缓存控制器113还可以在检测到缓存发生可纠正错误时,收集缓存114的工作信息,从而使处理器110可以通过缓存控制器113获得缓存114的工作信息集合。缓存114的工作信息集合中包含缓存114发生的可纠正错误的信息,该可纠正错误的信息可以包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在缓存中的地址、或可纠正错误的错误数据。缓存114的工作信息集合还可以包括缓存114的总访问次数或与可纠正错误的错误数据对应的正确数据。
可选地,缓存控制器113可以通过错误校验码(error correction code,ECC)算法检测缓存是否发生错误。具体检测方式为:在数据写入缓存时,ECC算法可以根据该数据产生第一错误校验码,并加入到该数据的额外数据位中,该数据与第一错误校验码可以保存在缓存中。当该数据被读出时,ECC算法可以根据被读出的数据产生第二错误校验码,将 第一错误校验码与第二错误校验码进行比较,检测缓存是否发生错误。如果第一错误校验码与第二错误校验码相同,则表示缓存114没有发生错误;如果第一错误校验码和第二错误校验码不同,则表示缓存114发生错误。如果缓存114发生的错误为可纠正错误,则可以使用第一错误校验码和第二错误校验码确定具体出错的数据位,从而获得正确数据。如果缓存发生的错误为不可纠正错误,则无法根据第一错误校验码和第二错误校验码获得正确数据。也就是说,当缓存发生不可纠正错误时,从缓存中读出的数据为错误数据,该错误数据可能会对整个计算设备造成影响。
例如,若写入数据为10010110,则该写入数据的第0位至第7位数据分别为0、1、1、0、1、0、0、1。根据ECC算法,对该写入数据的第0、2、4、6位的数据进行异或,可以获得该写入数据的第0、2、4、6位的校验位为0。同理可以获得该写入数据的第0、1、4、5位的校验位为0,该写入数据的第0、1、2、3位的校验位为0,该写入数据的第4、5、6、7位的校验位为0。根据ECC算法,对该写入数据的第0位至第7位数据进行异或,可以获得该写入数据的行校验位为0。也就是说,根据写入数据10010110,可以确定该写入数据的第一错误校验码为00000。第一错误校验码的第0位至第5位数据分别为该写入数据的第0、2、4、6位的校验位、该写入数据的第0、1、4、5位的校验位、该写入数据的第0、1、2、3位的校验位、该写入数据的第4、5、6、7位的校验位、该写入数据的行校验位。
例如,若读出数据为10010111,则根据ECC算法可以获得该读出数据的第二错误校验码为10111。由于该第二错误校验码与第一错误校验码不相同,因此可以确定发生了错误。由于该读出数据的第4、5、6、7位的校验位为0,并且其余校验位为1,则可以假设该读出数据发生了一个数据位错误。同时由于该读出数据的第4、5、6、7位的校验位与写入数据的第4、5、6、7位的校验位相同,则可以确定该读出数据的第4、5、6、7位未发生错误。由于该读出数据的第0、2、4、6位的校验位为1、该读出数据的第0、1、4、5位的校验位为1、该读出数据的第0、1、2、3位的校验位为1,因此可以确定该出错的数据位为第0位。对读出数据的第0位数据进行修复,可以获得10010110。根据ECC算法,修复后的数据可以获得的第三错误校验码为000000,与第一错误校验码相同。因此,根据ECC算法,可以将该读出数据修复为10010110。由于修复后的数据与写入数据一致,因此发生的是可纠正错误,不会对计算设备100产生影响。
例如,若读出数据为01011001,则根据ECC算法可以获得该读出数据的第二错误校验码为00001。由于该第二错误校验码与第一错误校验码不同,因此可以确定发生了错误。由于该读出数据的第0、2、4、6位的校验位为1,并且其余校验位为0,则可以假设该读出数据发生了两个数据位错误。同时由于该读出数据的第0、2、4、6位的校验位为1、该读出数据的第0、1、4、5位的校验位为0、该读出数据的第0、1、2、3位的校验位为0、该读出数据的第4、5、6、7位的校验位为0,则可以确定该读出数据的第0、2、4、6位中的一位发生了错误,并且该读出数据的其余数据位中的一位也发生了错误。若假设第4、5位数据位发生了错误,则可以对读出数据的第4、5位数据进行修复,获得01101001。根据ECC算法,修复后的数据可以获得的第三错误校验码为00000,与第一错误校验码相同。因此,根据ECC算法,可以将该读出数据修复为01011001。由于修复后的数据与写入数据不一致,因此发生的是不可纠正错误,可能会对计算设备100产生影响。
内存控制器120可以控制内存121,并且可以负责内存121与处理器110之间的数据 交换。内存控制器120还可以检测内存121是否发生错误,该错误可以包括可纠正错误或不可纠正错误。内存控制器120可以在检测到内存发生可纠正错误时,收集内存121的工作信息,从而使处理器110可以从内存控制器120中获得内存121的工作信息集合。内存121的工作信息集合中包含内存121发生可纠正错误的信息,该工作信息集合中的每条工作信息可以包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在内存中的地址、或可纠正错误的错误数据。内存121的工作信息集合还可以包括内存121的总访问次数或与可纠正错误的错误数据对应的正确数据。
可选地,内存控制器120可以通过ECC算法检测业务内存是否发生错误。业务内存为正在与处理器110或外部存储器进行数据交换的内存。
可选地,内存控制器120可以通过内存控制器120中的硬件引擎后台检测业务内存是否发生错误。具体实现方式为:硬件引擎后台在不影响正常读写的前提下读取业务内存中的数据,如果根据读取的数据计算的第二错误校验码与该数据的额外数据位中的第一错误校验码不相同,则表示业务内存发生错误。
可选地,内存控制器120可以通过内存控制器120中的内存管理模块检测空闲内存是否发生错误。具体实现方式为:内存管理模块将数据写入空闲内存,然后再从空闲内存中读取数据,并对写入时的数据与读取时的数据进行比较。若写入时的数据与读取时的数据一致,则表示该空闲内存未发生错误。若写入时的数据与读取时的数据不一致,则表示该空闲内存发生错误。
外部存储器接口130可以用于连接外部存储器,例如易失性存储器或非易失性存储器等,实现扩展计算设备100的存储能力。外部存储器通过外部存储器接口130与处理器110通信,实现数据存储功能。
计算设备100可以通过扬声器140实现音频功能,例如播放音乐等。
显示屏150用于显示文字、图像、视频等。显示屏150包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)、有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED)、柔性发光二极管(flex light-emitting diode,FLED)、Miniled、MicroLed、Micro-oLed、量子点发光二极管(quantum dot light emitting diodes,QLED)等。计算设备100通过显示屏150实现显示功能。在一些实施例中,计算设备100可以包括1个或多个显示屏150。
图1中的计算设备100可以通过扬声器140或显示屏150向用户发出提示信息。该提示信息可以用于指示该计算设备100中的易失性存储介质发生了不可纠正错误,或者,该提示信息可以用于指示该计算设备100中的易失性存储介质发生不可纠正错误的风险评估结果。或者,该提示信息可以用于指示该计算设备100中发生不可纠正错误的易失性存储介质的标识信息。该标识信息可以包括该发生不可纠正错误的易失性存储介质的产品编号或具体位置等信息。
图1中的计算设备100可以预测易失性存储介质的不可纠正错误,从而判断易失性存储介质的健康状态,以指导用户可以进行更换,避免影响计算设备或易失性存储介质的正常工作。
图2是易失性存储介质不可纠正错误的预测方法的示意性流程图,图2中的方法包括如下步骤。
S210,获取存储设备中的易失性存储介质的工作信息集合。
计算设备可以获得存储设备中的易失性存储介质的工作信息集合,该存储设备可以在该计算设备中,或者该存储设备可以与该计算设备相连接。
可选地,该计算设备可以实时、持续性的获得该易失性存储介质的工作信息集合,也可以周期性的获得该易失性存储介质的工作信息集合。该计算设备还可以在易失性存储介质发生第n个可纠正错误后获得该易失性存储介质的工作信息集合,n为预设阈值。或者该计算设备可以在接收到获取指令后获得该易失性存储介质的工作信息集合,本申请实施例对此并不限定。
可选地,工作信息集合可以包括易失性存储介质发生的可纠正错误的信息,该可纠正错误的信息可以包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在易失性存储介质中的地址、或可纠正错误的错误数据。
可选地,易失性存储介质发生的任一个可纠正错误的信息可以为一条工作信息。也就是说,工作信息集合可以包括至少一条工作信息,至少一条工作信息中的每条工作信息为易失性存储介质发生的一个可纠正错误的信息。每条工作信息可以包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在易失性存储介质中的地址、或可纠正错误的错误数据。
可选地,可纠正错误的错误数据在易失性存储介质中的地址可以包括:该可纠正错误的错误数据在该易失性存储介质中所属存储矩阵(bank)的标识、该错误数据在该所属存储矩阵中的所属存储行(row)的标识或所属存储列(column)的标识中的任一个或多个。
可选地,可纠正错误的错误数据在易失性存储介质中的地址还可以包括:该可纠正错误的错误数据在该易失性器中所属DQ的标识或该可纠正错误的错误数据在该易失性存储介质中所属存储区块(rank)的标识。
可选地,该易失性存储介质的工作信息集合还可以包括该易失性存储介质的总访问次数或与可纠正错误的错误数据对应的正确数据。
S220,根据工作信息集合和预测模型,确定易失性存储介质发生不可纠正错误的风险评估结果。
计算设备可以根据易失性存储介质的工作信息集合与预测模型,对易失性存储介质发生不可纠正错误的风险进行评估,从而获得风险评估结果。
可选地,计算设备可以根据易失性存储介质的工作信息集合,直接对易失性存储介质发生不可纠正错误的风险进行评估。
可选地,计算设备可以根据工作信息集合中的每条工作信息包括的任一个或多个信息,确定易失性存储介质发生不可纠正错误的风险评估结果。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储矩阵,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储行,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储列,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正 错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储矩阵,并且每个错误数据在所属存储矩阵中属于同一存储行,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储矩阵,并且每个错误数据在所属存储矩阵中属于同一存储列,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储矩阵,并且每个错误数据在所属存储矩阵中属于同一存储行与同一存储列,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以表示该易失性存储介质发生不可纠正错误的概率较低。该情况一为:每个错误数据属于同一DQ,并且每个错误数据在所属DQ中属于同一存储矩阵、以及在所属存储矩阵中属于同一存储列与同一存储行。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以表示该易失性存储介质发生不可纠正错误的概率较高。该情况二为:每个错误数据属于不同DQ,每个错误数据在所属DQ中属于不同存储矩阵,并且每个错误数据在所属存储矩阵中属于不同存储列或不同存储行。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的时间超出预设时间范围,则可以表示该易失性存储介质发生不可纠正错误的概率较低,即该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的时间在预设时间范围内,则可以表示该易失性存储介质发生不可纠正错误的概率较高,即该易失性存储介质发生不可纠正错误的风险较高。
可选地,计算设备可以根据工作信息集合中包括的工作信息的条数,确定易失性存储介质发生可纠正错误的数量。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,则可以表示该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。第一预设阈值可以是大于等于10,小于等于40的正整数。例如,第一预设阈值可以是20、25、30等。
例如,若易失性存储介质发生可纠正错误的数量高于第一预设阈值,并低于第二预设阈值,则可以表示该易失性存储介质发生不可纠正错误的概率较为中等。也就是说,此时该易失性存储介质发生不可纠正错误的风险评估结果为中风险。第二预设阈值可以是大于70,小于等于100的正整数。例如,第二预设阈值可以是80、85、90等。如果第一预设阈值或第二预设阈值设置的越大,则可能在确定易失性存储介质发生不可纠正错误的风险评估结果前,该易失性存储介质已经发生了不可纠正错误,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。如果第一预设阈值或第二预设阈值设置的越小,则可能在易失性存储介质发生不可纠正错误的概率较低情况下,确定该易失性存储介质发生不可纠正错误的风险评估结果为中风险或高风险,即确定易失性存储介质发生不可纠正 错误的风险评估结果的准确度越低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,则可以表示该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且该易失性存储介质的总访问次数高于第三预设阈值,则可以表示该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。第三预设阈值可以是大于700,小于等于1000的正整数。例如,第三预设阈值可以是800、850、900等。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且该易失性存储介质的总访问次数低于第四预设阈值,则可以表示该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。第四预设阈值可以是大于100,小于等于400的正整数。例如,第四预设阈值可以是200、250、300等。如果第三预设阈值或第四预设阈值设置的越大,则可能在确定易失性存储介质发生不可纠正错误的风险评估结果前,该易失性存储介质已经发生了不可纠正错误,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。如果第三预设阈值或第四预设阈值设置的越小,则可能在易失性存储介质发生不可纠正错误的概率较低情况下,确定该易失性存储介质发生不可纠正错误的风险评估结果为中风险或高风险,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且该易失性存储介质发生可纠正错误的时间超出预设时间范围,则可以表示该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且该易失性存储介质发生可纠正错误的时间在预设时间范围内,则可以表示该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址属于同一存储矩阵,则可以表示此时该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址属于不同存储矩阵,则可以表示此时该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以表示此时该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以表示此时该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较 高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质的总访问次数高于第三预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以表示此时该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质的总访问次数低于第四预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以表示此时该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质发生可纠正错误的时间超出预设时间范围,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以表示此时该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质发生可纠正错误的时间在预设时间范围内,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以表示此时该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质的总访问次数高于第三预设阈值,并且该易失性存储介质发生可纠正错误的时间超出预设时间范围,则可以表示此时该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质的总访问次数低于第四预设阈值,并且该易失性存储介质发生可纠正错误的时间在预设时间范围内,则可以表示此时该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质的总访问次数高于第三预设阈值,该易失性存储介质发生可纠正错误的时间超出预设时间范围,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以表示此时该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质的总访问次数低于第四预设阈值,该易失性存储介质发生可纠正错误的时间在预设时间范围内,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以表示此时该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
可选地,第一预设阈值、第二预设阈值、第三预设阈值、或第四预设阈值中的任一个或多个的实际取值或取值范围依赖于具体的易失性存储介质。针对不同的易失性存储介质,每个预设阈值的实际取值或取值范围可以相同,也可以不同,本申请实施例并不限定。
可选地,预测模型可以是易失性存储介质的工作信息集合与风险评估结果之间的映射关系。
可选地,预测模型可以是根据训练数据集,通过机器学习训练获得的模型。该训练数据集可以包括易失性存储介质的工作信息集合、风险评估结果、以及工作信息集合与风险评估结果之间的映射关系。或者,该训练数据集还可以包括故障原因、工作信息集合与故障原因的映射关系、以及故障原因与风险评估结果之间的映射关系。
可选地,在步骤S220前,计算设备可以获得已经训练好的预测模型。或者,在步骤S220前,计算设备可以获得训练数据集,并根据该训练数据集对模型进行训练,从而获得已经训练好的预测模型。
可选地,预测模型可以包括第一预测模型和第二预测模型。计算设备可以根据工作信息集合与第一预测模型,确定该易失性存储介质的故障原因。该计算设备还可以根据该故障原因与第二预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。具体方式可以参见图3的描述。
可选地,计算设备可以根据易失性存储介质的工作信息集合、发生可纠正错误的数量与工作信息集合的统计周期的时长,确定该易失性存储介质的错误特征集合。该易失性存储介质的错误特征集合可以包括以下信息中的任一个或多个:错误率、单位时间内发生的可纠正错误的数量、或可纠正错误在易失性存储介质中的存储单元中的分布情况。计算设备可以根据第一预测模型与该易失性存储介质的错误特征集合,确定该易失性存储介质的故障原因。该计算设备还可以根据该易失性存储介质的故障原因与第二预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。具体方式可以参见图4的描述。
可选地,计算设备可以对每条工作信息包括的可纠正错误的错误数据与对应于错误数据的正确数据进行逻辑运算,获得每条工作信息对应的运算结果。计算设备可以根据不可纠正错误模型、每条工作信息对应的运算结果与预测模型,确定风险评估结果。具体方式可以参见图5的描述。
计算设备可以根据易失性存储介质的工作信息集合与预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果,从而判断该易失性存储介质的健康状态。计算设备可以根据该易失性存储介质的健康状态,指导用户可以进行更换,避免影响计算设备或易失性存储介质的正常工作。
图3是易失性存储介质不可纠正错误的预测方法的示意性流程图,图3中的方法包括如下步骤。
S310,根据工作信息集合与第一预测模型,确定故障原因。
计算设备可以根据S210中获得的易失性存储介质的工作信息集合与第一预测模型,确定易失性存储介质的故障原因。
可选地,计算设备可以根据易失性存储介质的工作信息集合与第一预测模型,直接确定易失性存储介质的故障原因。
可选地,计算设备可以根据工作信息集合中的每条工作信息包括的任一个或多个信息,确定易失性存储介质的故障原因。
可选地,易失性存储介质的故障原因可以包括以下任一种或多种:电容漏电、字线(word line,WL)故障、子字线驱动器(sub-word driver,SWD)故障、主字线驱动器(main-word driver,MWD)故障、位线(bit line,BL)故障、敏感放大器(sense amplifier,SA)故障、存储矩阵(bank)控制电路故障、接触不良、或信号裕量(margin)不足等。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于同一存储矩阵,则可 以确定易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、WL故障、BL故障、或电容漏电。
例如,若可纠正错误的错误数据在易失性存储介质中的地址属于不同存储矩阵,则可以确定易失性存储介质的故障原因包括bank控制电路故障、接触不良、或margin不足。
例如,若可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以确定易失性存储介质的故障原因包括电容漏电或接触不良。该情况一为:每个错误数据属于同一DQ,并且每个错误数据在所属DQ中属于同一存储矩阵、以及在所属存储矩阵中属于同一存储列与同一存储行。
例如,若可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以确定易失性存储介质的故障原因包括bank控制电路故障、MWD故障、或SA故障。该情况二为:每个错误数据属于不同DQ,每个错误数据在所属DQ中属于不同存储矩阵,并且每个错误数据在所属存储矩阵中属于不同存储列或不同存储行。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质发生可纠正错误的时间超出预设时间范围,则可以确定易失性存储介质的故障原因包括WL故障、BL故障、电容漏电、接触不良、或margin不足。
例如,若易失性存储介质发生可纠正错误的时间在预设时间范围内,则可以确定易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,则可以确定此时易失性存储介质的故障原因包括WL故障、BL故障、电容漏电、接触不良、或margin不足。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,则可以确定此时易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、bank控制电路故障、接触不良、或margin不足。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且该易失性存储介质的总访问次数高于第三预设阈值,则可以确定易失性存储介质的故障原因包括WL故障、BL故障、电容漏电或接触不良。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且该易失性存储介质的总访问次数低于第四预设阈值,则可以确定易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且该易失性存储介质发生可纠正错误的时间超出预设时间范围,则可以确定易失性存储介质的故障原因包括电容漏电、接触不良、或margin不足。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且该易失性存储介质发生可纠正错误的时间在预设时间范围内,则可以确定易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以确定易失性存储介质的故障原因包括电容漏电或接触不良。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以确定易失性存储介质的故障原因 包括SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质的总访问次数高于第三预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以确定易失性存储介质的故障原因包括电容漏电或接触不良。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质的总访问次数低于第四预设阈值,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以确定易失性存储介质的故障原因包括SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质发生可纠正错误的时间超出预设时间范围,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以确定易失性存储介质的故障原因包括电容漏电或接触不良。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质发生可纠正错误的时间在预设时间范围内,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以确定易失性存储介质的故障原因包括SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质的总访问次数高于第三预设阈值,并且该易失性存储介质发生可纠正错误的时间超出预设时间范围,则可以确定易失性存储介质的故障原因包括电容漏电。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质的总访问次数低于第四预设阈值,并且该易失性存储介质发生可纠正错误的时间在预设时间范围内,则可以确定易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质发生可纠正错误的数量低于第一预设阈值,该易失性存储介质的总访问次数高于第三预设阈值,该易失性存储介质发生可纠正错误的时间超出预设时间范围,并且可纠正错误的错误数据在易失性存储介质中的地址为情况一,则可以确定易失性存储介质的故障原因包括电容漏电。
例如,若易失性存储介质发生可纠正错误的数量高于第二预设阈值,该易失性存储介质的总访问次数低于第四预设阈值,该易失性存储介质发生可纠正错误的时间在预设时间范围内,并且可纠正错误的错误数据在易失性存储介质中的地址为情况二,则可以确定易失性存储介质的故障原因包括bank控制电路故障。
可选地,第一预测模型可以是易失性存储介质的工作信息集合与故障原因之间的映射关系。
可选地,第一预测模型可以是根据训练数据集,通过机器学习训练获得的模型。该训练数据集可以包括易失性存储介质的工作信息集合、故障原因、以及工作信息集合与故障原因之间的映射关系。
可选地,在步骤S310前,计算设备可以获得已经训练好的第一预测模型。或者,在步骤S310前,计算设备可以获得训练数据集,并根据该训练数据集对模型进行训练,从而获得已经训练好的第一预测模型。
可选地,计算设备可以根据易失性存储介质的工作信息集合、发生可纠正错误的数量与工作信息集合的统计周期的时长,确定该易失性存储介质的错误特征集合。该错误特征 集合可以包括以下信息中的任一个或多个:错误率、单位时间内发生的可纠正错误的数量、或可纠正错误在易失性存储介质中的存储单元中的分布情况。计算设备还可以根据第一预测模型与该易失性存储介质的错误特征集合,确定该易失性存储介质的故障原因。具体方式可以参见图4的描述。
S320,根据故障原因与第二预测模型,确定风险评估结果。
计算设备可以根据易失性存储介质的故障原因与第二预测模型,判断易失性存储介质发生故障的严重程度,从而确定易失性存储介质发生不可纠正错误的风险评估结果。
可选地,第二预测模型可以是故障原因与风险评估结果之间的映射关系。
可选地,第二预测模型可以是根据训练数据集,通过机器学习训练获得的模型。该训练数据集可以包括故障原因、风险评估结果、以及故障原因与风险评估结果之间的映射关系。
可选地,在步骤S320前,计算设备可以获得已经训练好的第二预测模型。或者,在步骤S320前,计算设备可以获得训练数据集,并根据该训练数据集对模型进行训练,从而获得已经训练好的第二预测模型。
例如,若易失性存储介质的故障原因包括电容漏电时,则可以表示当前易失性存储介质发生的故障较为轻微,此时易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质的故障原因包括WL故障、BL故障、接触不良或margin不足中的任一种或多种时,则可以表示当前易失性存储介质发生的故障的严重程度较为中等,此时易失性存储介质发生不可纠正错误的概率较为中等。也就是说,此时该易失性存储介质发生不可纠正错误的风险较为中等。
例如,若易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障中的任一种或多种时,则可以表示当前易失性存储介质发生的故障较为严重,此时易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
可选地,计算设备可以根据易失性存储介质的故障原因和风险评估表,确定易失性存储介质发生不可纠正错误的风险评估结果。
例如,风险评估表如表1所示。
表1 风险评估表
故障原因 风险评估结果
SWD故障
SA故障
MWD故障
bank控制电路故障
WL故障
BL故障
margin不足
接触不良
电容漏电
可选地,表1用于指示每个故障原因和风险评估结果的对应关系。在一些实施例中,故障原因和风险评估结果可能存在其他对应关系,本申请实施例并不限定。
可选地,若计算设备确定的易失性存储介质的故障原因为多个时,可以根据每个故障原因对应的风险评估结果中等级最高的风险评估结果,确定该易失性存储介质发生不可纠正错误的风险评估结果。
例如,当易失性存储介质的故障原因包括电容漏电、接触不良、与bank控制电路故障时,可以确定每个故障原因对应的风险评估结果中等级最高的风险评估结果为高风险,因此该易失性存储介质发生不可纠正错误的概率较高。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
可选地,若计算设备确定的易失性存储介质的故障原因为多个时,可以根据故障原因的发生概率,确定该易失性存储介质发生不可纠正错误的风险评估结果。
例如,当易失性存储介质的故障原因包括电容漏电或接触不良,并且电容漏电发生的概率较大时,可以确定该易失性存储介质发生不可纠正错误的概率较低。也就是说,此时该易失性存储介质发生不可纠正错误的风险较低。
例如,当易失性存储介质的故障原因包括WL故障、BL故障、电容漏电、接触不良和margin不足,并且其中严重程度较为中等的故障发生的概率较大时,可以确定该易失性存储介质发生不可纠正错误的概率较为中等。也就是说,此时该易失性存储介质发生不可纠正错误的风险较为中等。
例如,当易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、bank控制电路故障、接触不良和margin不足,并且其中较为严重的故障发生的概率较大时,可以确定该易失性存储介质发生不可纠正错误的概率较大。也就是说,此时该易失性存储介质发生不可纠正错误的风险较高。
可选地,在一些实施例中,当计算设备确定易失性存储介质的故障原因为多个,并且每个故障原因对应的风险评估结果相同时,可以将该易失性存储介质发生不可纠正错误的风险评估结果确定为更高一级的风险评估结果。
例如,若计算设备确定易失性存储介质的故障原因包括WL故障和BL故障,则可以确定该易失性存储介质发生不可纠正错误的风险评估结果为高风险。
可选地,当易失性存储介质发生不可纠正错误的风险较高时,该易失性存储介质的健康状态较差,需要进行更换。当易失性存储介质发生不可纠正错误的风险较低时,该易失性存储介质的健康状态较好,不需要进行更换。
计算设备可以根据易失性存储介质的工作信息集合与第一预测模型,确定该易失性存储介质的故障原因。并且计算设备可以根据该易失性存储介质的故障原因与第二预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。计算设备可以根据该易失性存储介质发生不可纠正错误的风险评估结果,判断该易失性存储介质的健康状态,从而指导用户进行更换,避免影响计算设备或易失性存储介质的正常工作。
图4是易失性存储介质不可纠正错误的预测方法的示意性流程图,图4中的方法包括如下步骤。
S410,根据工作信息集合、发生可纠正错误的数量与统计周期的时长,确定该易失性存储介质的错误特征集合。
可选地,计算设备可以根据S210中获得的易失性存储介质的工作信息集合中包括的 工作信息的条数,确定该易失性存储介质发生可纠正错误的数量。
可选地,在易失性存储介质的工作信息集合中的每条工作信息包括可纠正错误的错误数据在易失性存储介质中的地址,并且工作信息集合还包括易失性存储介质的总访问次数的情况下,计算设备可以根据该工作信息集合、发生可纠正错误的数量与统计周期的时长,确定该易失性存储介质的错误特征集合。该统计周期为工作信息集合的统计周期。该错误特征集合可以包括以下信息中的任一个或多个:错误率、单位时间内发生的可纠正错误的数量、或可纠正错误在易失性存储介质中的存储单元中的分布情况。
可选地,计算设备可以根据易失性存储介质的工作信息集合,实时、持续性地获得该易失性存储介质的错误特征集合,也可以周期性地获得该易失性存储介质的错误特征集合。或者,计算设备可以在易失性存储介质发生第n个可纠正错误后获得该易失性存储介质的错误特征集合,n为预设阈值。或者,计算设备可以在接收到获取指令后获得该易失性存储介质的错误特征集合,本申请实施例对此并不限定。
可选地,计算设备可以根据每条工作信息中包括的可纠正错误的错误数据在易失性存储介质中的地址,确定统计周期内发生的可纠正错误在易失性存储介质中的存储单元的分布情况。该存储单元可以包括以下任一个或多个:存储矩阵、存储行、存储列、存储区块、或DQ。也就是说,分布情况可以包括可纠正错误的错误数据的地址所属存储矩阵的标识、所属存储行的标识、所属存储列的标识、所属存储区块的标识中、或所属DQ的标识中的任一个或多个是否相同。
例如,可纠正错误在易失性存储介质中的分布情况的可能情况如下表2所示。
表2 可纠正错误的分布情况
Figure PCTCN2022111694-appb-000001
Figure PCTCN2022111694-appb-000002
若可纠正错误在该易失性存储介质中的分布情况为表2中的分布情况1时,则可以表示该易失性存储介质中仅发生了一个错误。或者,分布情况1可以表示该易失性存储介质中发生了多个可纠正错误,并且多个可纠正错误的错误数据仅分布在一个DQ中,每个错误数据所属rank的标识、每个错误数据在所属rank中的所属bank的标识、每个错误数据在所属bank中的所属row的标识与所属column的标识均相同。
若可纠正错误在易失性存储介质中的分布情况为表2中的分布情况10,则可以表示易失性存储介质中发生了多个可纠正错误。该多个可纠正错误的错误数据分布在多个DQ中,并且每个错误数据所属rank的标识相同,每个错误数据在所属rank中的所属bank的标识不同,每个错误数据在所属bank中的所属row的标识或所属column的标识不同。
可选地,计算设备可以根据易失性存储介质的总访问次数和该易失性存储介质发生可纠正错误的数量,确定该易失性存储介质的错误率。
可选地,由于易失性存储介质中可以包括一个或多个存储矩阵,因此易失性存储介质的错误率可以包括一个或多个存储矩阵的错误率。每个存储矩阵的错误率可以为每个存储矩阵发生可纠正错误的数量与每个存储矩阵的总访问次数之比。
可选地,计算设备可以根据易失性存储介质发生可纠正错误的数量与工作信息集合的统计周期的时长,确定该易失性存储介质在单位时间内发生的可纠正错误的数量。
可选地,计算设备可以获得第一时间范围内易失性存储介质发生可纠错误的数量,从而确定该易失性存储介质在单位时间内发生的可纠正错误的数量。该第一时间范围可以是开始记录易失性存储介质发生可纠正错误的时间与结束记录该易失性存储介质发生可纠正错误的时间的差值。或者,该第一时间范围可以是统计周期。
可选地,计算设备可以获得第二时间范围内易失性存储介质发生可纠错误的数量,从而确定该易失性存储介质在单位时间内发生的可纠正错误的数量。该第二时间范围可以是易失性存储介质发生第一错误的时间与发生第二错误的时间的差值。该第一错误与第二错误为该易失性存储介质发生的可纠正错误中的任意两个不同时发生的可纠正错误,并且易失性存储介质发生第一错误的时间早于易失性存储介质发生第二错误的时间。
可选地,易失性存储介质在单位时间内发生的可纠正错误的数量可以包括一个或多个存储矩阵在单位时间内发生的可纠正错误的数量。每个存储矩阵在单位时间内发生的可纠 正错误的数量可以为每个存储矩阵发生可纠正错误的数量与时间范围之比。该时间范围可以是第一时间范围或第二时间范围,本申请实施例对此并不限定。
S420,根据易失性存储介质的错误特征集合与第一预测模型,确定易失性存储介质的故障原因。
当易失性存储介质出现了故障时,易失性存储介质会发生可纠正错误或不可纠正错误,并且不同的故障所导致的错误表现出的特征是不同的。因此计算设备可以根据步骤S410获得的易失性存储介质的错误特征集合与第一预测模型,确定该易失性存储介质的故障原因。
可选地,计算设备可以根据易失性存储介质的错误特征集合与第一预测模型,直接确定该易失性存储介质的故障原因。该错误特征集合可以包括以下信息中的任一个或多个:错误率、单位时间内发生的可纠正错误的数量、或可纠正错误在易失性存储介质中的存储单元中的分布情况。
可选地,若易失性存储介质的错误率低于第五预设阈值,则可以表示该易失性存储介质的错误率较低。若易失性存储介质的错误率高于第五预设阈值,并且该错误率低于第六预设阈值,则可以表示该易失性存储介质的错误率较为中等。若易失性存储介质的错误率高于第六预设阈值,则可以表示该易失性存储介质的错误率较高。第五预设阈值可以是大于等于0,小于0.2的正数。例如,第五预设阈值可以是0.01、0.1、0.15等。第六预设阈值可以是大于等于0.4,小于等于1的正数。例如,第六预设阈值可以是0.5、0.6、0.7等。如果第五预设阈值或第六预设阈值设置的越大,则可能在确定易失性存储介质发生不可纠正错误的风险评估结果前,该易失性存储介质已经发生了不可纠正错误,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。如果第五预设阈值或第六预设阈值设置的越小,则可能在易失性存储介质发生不可纠正错误的概率较低情况下,确定该易失性存储介质发生不可纠正错误的风险评估结果为中风险或高风险,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。
可选地,若易失性存储介质在单位时间内发生的可纠正错误的数量低于第七预设阈值,则可以表示该易失性存储介质在单位时间内发生的可纠正错误的数量较低。若易失性存储介质在单位时间内发生的可纠正错误的数量高于第七预设阈值,并且该单位时间内发生的可纠正错误的数量低于第八预设阈值,则可以表示该易失性存储介质在单位时间内发生的可纠正错误的数量较为中等。若易失性存储介质在单位时间内发生的可纠正错误的数量高于第八预设阈值,则可以表示该易失性存储介质在单位时间内发生的可纠正错误的数量较高。第七预设阈值可以是大于10,小于等于40的正整数。例如,第七预设阈值可以是15、20、25等。第八预设阈值可以是大于70,小于等于100的正整数。例如,第七预设阈值可以是75、80、85等。如果第七预设阈值或第八预设阈值设置的越大,则可能在确定易失性存储介质发生不可纠正错误的风险评估结果前,该易失性存储介质已经发生了不可纠正错误,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。如果第七预设阈值或第八预设阈值设置的越小,则可能在易失性存储介质发生不可纠正错误的概率较低情况下,确定该易失性存储介质发生不可纠正错误的风险评估结果为中风险或高风险,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。
可选地,第五预设阈值、第六预设阈值、第七预设阈值、或第八预设阈值中的任一个或多个的实际取值或取值范围依赖于具体的易失性存储介质。针对不同的易失性存储介质, 每个预设阈值的实际取值或取值范围可以相同,也可以不同,本申请实施例并不限定。
例如,若易失性存储介质的错误率为低时,则可以直接确定该易失性存储介质的故障原因包括WL故障、BL故障、电容漏电或margin不足。
例如,若易失性存储介质的错误率为高时,则可以直接确定该易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、bank控制电路故障、或接触不良。
例如,若易失性存储介质在单位时间内发生的可纠正错误的数量比例为低时,则可以直接确定该易失性存储介质的故障原因包括WL故障、BL故障、电容漏电、接触不良、或margin不足。
例如,若易失性存储介质在单位时间内发生的可纠正错误的数量比例为高时,则可以直接确定该易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障。
例如,若可纠正错误在易失性存储介质中的分布情况为表2中的分布情况1时,则可以直接确定该易失性存储介质的故障原因包括WL故障或电容漏电。
例如,若可纠正错误在易失性存储介质中的分布情况为表2中的分布情况10时,则可以直接确定该易失性存储介质的故障原因包括SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质的错误率为低,并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况1时,则可以直接确定该易失性存储介质的故障原因包括WL故障或电容漏电。
例如,若易失性存储介质的错误率为高,并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况10时,则可以直接确定该易失性存储介质的故障原因包括SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质在单位时间内发生的可纠正错误的数量为低,并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况1时,则可以直接确定该易失性存储介质的故障原因包括WL故障或电容漏电。
例如,若易失性存储介质在单位时间内发生的可纠正错误的数量为高,并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况10时,则可以直接确定该易失性存储介质的故障原因包括SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质的错误率为低,并且该易失性存储介质在单位时间内发生的可纠正错误的数量为低时,则可以直接确定该易失性存储介质的故障原因包括电容漏电。
例如,若易失性存储介质的错误率为高,并且该易失性存储介质在单位时间内发生的可纠正错误的数量为高时,则可以直接确定该易失性存储介质的故障原因包括SWD故障、SA故障、MWD故障、或bank控制电路故障。
例如,若易失性存储介质的错误率与单位时间内发生的可纠正错误的数量均为低,并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况1时,则可以直接确定该易失性存储介质的故障原因为电容漏电。
例如,若易失性存储介质的错误率与单位时间内发生的可纠正错误的数量均为高,并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况10时,则可以直接确定该易失性存储介质的故障原因为bank控制电路故障。
可选地,计算设备可以根据易失性存储介质的错误率、单位时间内发生的可纠正错误 的数量、或可纠正错误在易失性存储介质中的存储单元中的分布情况中的任一个或多个,从故障原因表中确定易失性存储介质的故障原因。
例如,若可纠正错误在易失性存储介质中的分布情况的可能情况如表2所示,则该易失性存储介质的故障原因表可以如下表3所示:
表3 故障原因表
Figure PCTCN2022111694-appb-000003
例如,若易失性存储介质的错误率为低、单位时间内发生的可纠正错误的数量为低、并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况1时,则可以根据表3所示的故障原因表,确定该易失性存储介质的故障原因为电容漏电。
例如,若易失性存储介质的错误率为高、单位时间内发生的可纠正错误的数量为高、并且可纠正错误在易失性存储介质中的分布情况为表2中的分布情况7时,则可以根据表3所示的故障原因表,确定该易失性存储介质的故障原因为SWD故障。
S430,根据故障原因与第二预测模型,确定风险评估结果。步骤S430的具体实现方式与步骤S320类似,此处不再赘述。
计算设备可以根据易失性存储介质的工作信息集合,确定该易失性存储介质的错误特征集合。计算设备可以根据该易失性存储介质的错误特征集合,确定该易失性存储介质的故障原因。计算设备还可以根据该易失性存储介质的故障原因与第二预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。因此,计算设备可以判断该易失性存储介质的健康状态,从而指导用户进行更换,避免影响计算设备或易失性存储介质的正常工作。
图5是易失性存储介质不可纠正错误的预测方法的示意性流程图,图5中的方法包括如下步骤。
S510,对每条工作信息包括的可纠正错误的错误数据与对应于错误数据的正确数据进行逻辑运算,获得每条工作信息对应的运算结果。
计算设备可以对工作信息集合中的每条工作信息包括的可纠正错误的错误数据和与该错误数据对应的正确数据进行逻辑运算,获得该错误数据与正确数据的运算结果。
可选地,该逻辑运算可以是异或运算、同或运算、与运算、或运算等逻辑运算中的任一种运算。
可选地,在步骤S510前,计算设备可以获得易失性存储介质的工作信息集合。该工作信息集合中的每条工作信息可以包括可纠正错误的错误数据。计算设备可以根据该易失性存储介质的纠错算法与该错误数据,获得与该错误数据对应的正确数据。或者,该工作信息集合中的每条工作信息可以包括可纠正错误的错误数据与正确数据,该错误数据与该正确数据对应。
S520,根据不可纠正错误模型、每条工作信息对应的运算结果与预测模型,确定风险评估结果。
由于每个纠错算法均可能存在一定的局限性,即对每个纠错算法而言,均可能存在一个或多个无法被该纠错算法进行纠错的数据。针对每个纠错算法,可以将该纠错算法无法纠错的数据作为不可纠正错误模型。同时,由于每个纠错算法的纠错原理为根据运算规则,对正确数据与错误数据进行运算,从而实现纠错功能。因此可以根据相似的运算规则对可纠正错误的正确数据与错误数据进行运算,获得运算结果,并通过比较该运算结果与不可纠正错误模型的相似度,确定该易失性存储介质发生不可纠正错误的风险评估结果。
可选地,在步骤S520前,计算设备可以获得该易失性存储介质的不可纠正错误模型。该不可纠正错误模型为根据该易失性存储介质的纠错算法的纠错原理所确定的数据。
可选地,计算设备可以对不可纠正错误模型与每条工作信息对应的运算结果进行比较,获得每条工作信息对应的相似度。计算设备还可以根据每条工作信息对应的相似度与预测模型,确定风险评估结果。
可选地,计算设备可以通过比较每条工作信息对应的运算结果与不可纠正错误模型的每个数据位的数据是否相同,获得数据相同的数据位数量,并将该数据位数量作为每条工作信息对应的相似度。
可选地,计算设备可以将每条工作信息对应的运算结果与不可纠正错误模型的数据同时为1的数据位数量作为每条工作信息对应的相似度。
可选地,计算设备可以将每条工作信息对应的运算结果与不可纠正错误模型的数据同时为0的数据位数量作为每条工作信息对应的相似度。
例如,假设易失性存储介质的纠错算法为ECC,并且该ECC的不可纠正错误模型为1101101111010000,ECC的纠错原理的运算规则为异或运算。若该易失性存储介质的工作信息集合中的任意三条工作信息包括的可纠正错误的错误数据、与该错误数据对应的正确数据、每条工作信息对应的异或运算结果、以及每条工作信息对应的相似度如表4所示。
表4 相似度表
编号 错误数据 正确数据 异或运算结果 相似度
1 1101101111000000 0000000000000000 1101101111000000
2 0100101100010000 0000000000000000 0100101100010000
3 0000000000010000 0000000000000000 0000000000010000
例如,若易失性存储介质的工作信息集合中包括M条工作信息,并且其中的第m条工作信息包括的可纠正错误的错误数据、与该错误数据对应的正确数据、和与第m条工作信息对应的异或运算结果为表4中的第1行数据时,则可以确定该异或运算结果与不可纠正错误模型的相似度较高,即该可纠正错误无法被纠错算法纠错的概率较高。m=1,……M,M为大于或等于1的正整数。
例如,若易失性存储介质的第m条工作信息包括的可纠正错误的错误数据、与该错 误数据对应的正确数据、和与第m条工作信息对应的异或运算结果为表4中的第3行数据时,则可以确定该异或运算结果与不可纠正错误模型的相似度较低,即该可纠正错误无法被纠错算法纠错的概率较低。
可选地,预测模型可以是每条工作信息对应的相似度与风险评估结果之间的映射关系。
可选地,预测模型可以是根据训练数据集,通过机器学习训练获得的模型。该训练数据集可以包括每条工作信息对应的相似度、风险评估结果、以及每条工作信息对应的相似度与风险评估结果之间的映射关系。
可选地,在步骤S520前,计算设备可以获得已经训练好的预测模型。或者,在步骤S520前,计算设备可以获得训练数据集,并根据该训练数据集对模型进行训练,从而获得已经训练好的预测模型。
可选地,若第m条工作信息对应的相似度较高,则可以表示第m条工作信息对应的可纠正错误无法被纠错算法纠错的概率较大,即此时该易失性存储介质发生不可纠正错误的风险较高。若第m条工作信息对应的相似度较低,则可以表示第m条工作信息对应的可纠正错误无法被纠错算法纠错的概率较低,即此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若易失性存储介质的第m条工作信息包括的可纠正错误的错误数据、与该错误数据对应的正确数据、和与第m条工作信息对应的异或运算结果为表4中的第1行数据时,则可以表示第m条工作信息对应的可纠正错误无法被纠错算法纠错的概率较大。也就是说,此时可以确定该易失性存储介质发生不可纠正错误的风险较高。
例如,若易失性存储介质的第m条工作信息包括的可纠正错误时的错误数据、与该错误数据对应的正确数据、和与第m条工作信息对应的异或运算结果为表4中的第3行数据时,则可以表示第m条工作信息对应的可纠正错误无法被纠错算法纠错的概率较低。也就是说,此时可以确定该易失性存储介质发生不可纠正错误的风险较低。
可选地,计算设备可以将每条工作信息对应的相似度与第九预设阈值进行比较,从而确定该易失性存储介质发生不可纠正错误的风险评估结果。第九预设阈值可以是大于等于10,小于等于16的正整数。例如,第九预设阈值可以是11、12、13等。如果第九预设阈值设置的越大,则可能在确定易失性存储介质发生不可纠正错误的风险评估结果前,该易失性存储介质已经发生了不可纠正错误,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。如果第九预设阈值设置的越小,则可能在易失性存储介质发生不可纠正错误的概率较低情况下,确定该易失性存储介质发生不可纠正错误的风险评估结果为高风险,即确定易失性存储介质发生不可纠正错误的风险评估结果的准确度越低。
可选地,第九预设阈值的实际取值或取值范围依赖于以下任一个或多个:易失性存储介质、纠错算法或读写数据的数据位数量。针对不同的易失性存储介质、纠错算法、或读写数据的数据位数量,第九预设阈值的实际取值或取值范围可以相同,也可以不同,本申请实施例并不限定。
例如,若第m条工作信息对应的相似度小于第九预设阈值,则可以表示第m条工作信息对应的可纠正错误无法被纠错算法纠错的概率较低,即此时该易失性存储介质发生不可纠正错误的风险较低。
例如,若第m条工作信息对应的相似度大于第九预设阈值,则可以表示第m条工作信息对应的可纠正错误无法被纠错算法纠错的概率较高,即此时该易失性存储介质发生不 可纠正错误的风险较高。
可选地,计算设备可以根据每条工作信息对应的相似度与预测模型,确定每条工作信息对应的风险评估结果。计算设备还可以将等级最高的风险评估结果作为该易失性存储介质发生不可纠正错误的风险评估结果。
例如,假设易失性存储介质的工作信息集合中包括10条工作信息。若10条工作信息中的1条工作信息对应的风险评估结果为高风险,则可以确定该10条工作信息对应的风险评估结果中等级最高的风险评估结果为高风险。也就是说,该易失性存储介质发生不可纠正错误的风险评估结果为高风险。
可选地,计算设备可以根据每条工作信息对应的相似度与预测模型,确定每条工作信息对应的风险评估结果。计算设备还可以将出现频率最高的风险评估结果作为该易失性存储介质发生不可纠正错误的风险评估结果。
例如,假设易失性存储介质的工作信息集合中包括10条工作信息。若10条工作信息中的8条工作信息对应的风险评估结果为低风险,2条工作信息对应的风险评估结果为中风险,则可以确定该10条工作信息对应的风险评估结果中出现频率最高的风险评估结果为低风险,即该易失性存储介质发生不可纠正错误的风险评估结果为低风险。
例如,若10条工作信息中的8条工作信息对应的风险评估结果为高风险,2条工作信息对应的风险评估结果为中风险,则可以确定该10条工作信息对应的风险评估结果中出现频率最高的风险评估结果为高风险,即该易失性存储介质发生不可纠正错误的风险评估结果为高风险。
计算设备可以根据易失性存储介质的工作信息集合,获得该易失性存储介质发生可纠正错误时的错误数据与正确数据的运算结果。计算设备还可以获得不可纠正错误模型,并根据该不可纠正错误模型、该运算结果与预测模型,确定该易失性存储介质发生不可纠正错误的风险评估结果。因此,计算设备可以判断该易失性存储介质的健康状态,从而指导用户进行更换,避免影响计算设备或易失性存储介质的正常工作。
以上描述了根据本申请实施例的易失性存储介质不可纠正错误的预测方法,下面结合图6描述根据本申请实施例的计算装置和相关设备。
图6是根据本申请一个实施例的计算装置的结构示意图。计算装置600包括获取模块610和处理模块620。
获取模块610用于获得存储设备中的易失性存储介质的工作信息集合。获取模块610可以执行图2的方法中的步骤S210。
处理模块620用于根据工作信息集合与预测模型,确定易失性存储介质发生不可纠正错误的风险评估结果。处理模块620可以执行图2的方法中的步骤S220、图3的方法中的步骤S310、S320、图4的方法中的步骤S410至S430、图5的方法中的步骤S510、S520中的部分或全部步骤。
本申请实施例还提供了一种计算设备,该计算设备包括处理器,该处理器用于与存储器耦合,读取并执行该存储器中的指令和/或程序代码,以执行图2至图5中的各个步骤。
本申请实施例还提供了一种芯片系统,该芯片系统包括逻辑电路,该逻辑电路用于与输入/输出接口耦合,通过该输入/输出接口传输数据,以执行图2至图5中的各个步骤。
根据本申请实施例提供的方法,本申请还提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行图 2至图5中的各个步骤。
根据本申请实施例提供的方法,本申请还提供一种计算机可读介质,该计算机可读介质存储有程序代码,当该程序代码在计算机上运行时,使得该计算机执行图2至图5中的各个步骤。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读易失性存储器(read-only memory,ROM)、随机存取易失性存储器(RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 一种易失性存储介质不可纠正错误的预测方法,其特征在于,包括:
    获取存储设备中的易失性存储介质的工作信息集合,所述工作信息集合中包括所述易失性存储介质发生的可纠正错误的信息,所述可纠正错误的信息包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在所述易失性存储介质中的地址、或所述可纠正错误的错误数据;
    根据所述工作信息集合与预测模型,确定所述易失性存储介质发生不可纠正错误的风险评估结果。
  2. 根据权利要求1所述的方法,其特征在于,所述预测模型包括第一预测模型和第二预测模型,所述根据所述工作信息集合与预测模型,确定所述易失性存储介质发生不可纠正错误的风险评估结果,包括:
    根据所述工作信息集合与所述第一预测模型,确定故障原因;
    根据所述故障原因与所述第二预测模型,确定所述风险评估结果。
  3. 根据权利要求2所述的方法,其特征在于,所述易失性存储介质的故障原因包括以下任一个或多个:
    电容漏电、字线故障、子字线驱动器故障、主字线驱动器故障、位线故障、敏感放大器故障、存储矩阵控制电路故障、接触不良、或信号裕量不足。
  4. 根据权利要求1所述的方法,其特征在于,在所述工作信息集合中的每条工作信息包括所述可纠正错误的错误数据的情况下,根据所述工作信息集合与预测模型,确定所述易失性存储介质发生不可纠正错误的风险评估结果,包括:
    对所述每条工作信息包括的所述可纠正错误的错误数据和对应于所述错误数据的正确数据进行逻辑运算,获得所述每条工作信息对应的运算结果;
    根据不可纠正错误模型、所述每条工作信息对应的运算结果与所述预测模型,确定所述风险评估结果。
  5. 根据权利要求4所述的方法,其特征在于,根据不可纠正错误模型、所述每条工作信息对应的运算结果与所述预测模型,确定所述风险评估结果,包括:
    对所述不可纠正错误模型与所述每条工作信息对应的运算结果进行比较,获得每条工作信息对应的相似度;
    根据所述每条工作信息对应的相似度与所述预测模型,确定所述每条工作信息对应的风险评估结果;
    将等级最高的风险评估结果作为所述易失性存储介质发生不可纠正错误的风险评估结果。
  6. 一种计算装置,其特征在于,包括:
    获取模块,用于获取存储设备中的易失性存储介质的工作信息集合,所述工作信息集合中包括所述易失性存储介质发生的可纠正错误的信息,所述可纠正错误的信息包括以下信息中的任一个或多个:发生可纠正错误的时间、可纠正错误的错误数据在所述易失性存储介质中的地址、或所述可纠正错误的错误数据;
    处理模块,用于根据所述工作信息集合与预测模型,确定所述易失性存储介质发生不 可纠正错误的风险评估结果。
  7. 根据权利要求6所述的装置,其特征在于,所述预测模型包括第一预测模型和第二预测模型,
    所述处理模块,用于根据所述工作信息集合与所述第一预测模型,确定故障原因;
    所述处理模块,还用于根据所述故障原因与所述第二预测模型,确定所述风险评估结果。
  8. 根据权利要求7所述的装置,其特征在于,所述易失性存储介质的故障原因包括以下任一个或多个:
    电容漏电、字线故障、子字线驱动器故障、主字线驱动器故障、位线故障、敏感放大器故障、存储矩阵控制电路故障、接触不良、或信号裕量不足。
  9. 根据权利要求6所述的装置,其特征在于,在所述工作信息集合中的每条工作信息包括所述可纠正错误的错误数据的情况下,所述处理模块,用于对所述每条工作信息包括的所述可纠正错误的错误数据与对应于所述错误数据的正确数据进行逻辑运算,获得所述每条工作信息对应的运算结果;
    所述处理模块,还用于根据不可纠正错误模型、所述每条工作信息对应的运算结果与所述预测模型,确定所述风险评估结果。
  10. 根据权利要求9所述的装置,其特征在于,所述处理模块,用于对所述不可纠正错误模型与所述每条工作信息对应的运算结果进行比较,获得每条工作信息对应的相似度;
    所述处理模块,还用于根据所述每条工作信息对应的相似度与所述预测模型,确定所述每条工作信息对应的风险评估结果;
    所述处理模块,还用于将等级最高的风险评估结果作为所述易失性存储介质发生不可纠正错误的风险评估结果。
  11. 一种计算设备,其特征在于,包括:处理器,所述处理器用于与存储器耦合,读取并执行所述存储器中的指令和/或程序代码,以执行如权利要求1-5中任一项所述的方法。
  12. 一种芯片系统,其特征在于,包括:逻辑电路,所述逻辑电路用于与输入/输出接口耦合,通过所述输入/输出接口传输数据,以执行如权利要求1-5中任一项所述的方法。
  13. 一种计算机可读介质,其特征在于,所述计算机可读介质存储有程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求1-5中任一项所述的方法。
PCT/CN2022/111694 2022-01-29 2022-08-11 一种易失性存储介质不可纠正错误的预测方法和相关设备 WO2023142429A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210111886.1 2022-01-29
CN202210111886.1A CN116560897A (zh) 2022-01-29 2022-01-29 一种易失性存储介质不可纠正错误的预测方法和相关设备

Publications (1)

Publication Number Publication Date
WO2023142429A1 true WO2023142429A1 (zh) 2023-08-03

Family

ID=87470310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/111694 WO2023142429A1 (zh) 2022-01-29 2022-08-11 一种易失性存储介质不可纠正错误的预测方法和相关设备

Country Status (2)

Country Link
CN (1) CN116560897A (zh)
WO (1) WO2023142429A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820828B (zh) * 2023-08-29 2024-01-09 苏州浪潮智能科技有限公司 可纠正错误阈值设定方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1204232A1 (en) * 2000-11-06 2002-05-08 Lucent Technologies Inc. Detection of uncorrectable data blocks in coded communications systems
US20090164872A1 (en) * 2007-12-21 2009-06-25 Sun Microsystems, Inc. Prediction and prevention of uncorrectable memory errors
CN105575434A (zh) * 2014-10-31 2016-05-11 英飞凌科技股份有限公司 非易失性存储器的健康状态
CN105912437A (zh) * 2015-02-19 2016-08-31 发那科株式会社 控制装置的故障预测系统
CN113495815A (zh) * 2020-04-07 2021-10-12 英特尔公司 基于计算机总线的错误记录表征错误相关性

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1204232A1 (en) * 2000-11-06 2002-05-08 Lucent Technologies Inc. Detection of uncorrectable data blocks in coded communications systems
US20090164872A1 (en) * 2007-12-21 2009-06-25 Sun Microsystems, Inc. Prediction and prevention of uncorrectable memory errors
CN105575434A (zh) * 2014-10-31 2016-05-11 英飞凌科技股份有限公司 非易失性存储器的健康状态
CN105912437A (zh) * 2015-02-19 2016-08-31 发那科株式会社 控制装置的故障预测系统
CN113495815A (zh) * 2020-04-07 2021-10-12 英特尔公司 基于计算机总线的错误记录表征错误相关性

Also Published As

Publication number Publication date
CN116560897A (zh) 2023-08-08

Similar Documents

Publication Publication Date Title
KR102451163B1 (ko) 반도체 메모리 장치 및 그것의 리페어 방법
US10304515B2 (en) Refresh circuitry
US9747148B2 (en) Error monitoring of a memory device containing embedded error correction
US10002043B2 (en) Memory devices and modules
US7603528B2 (en) Memory device verification of multiple write operations
US11232848B2 (en) Memory module error tracking
US7523364B2 (en) Double DRAM bit steering for multiple error corrections
CN101379566B (zh) 用于修复高速缓存阵列中单元的装置、系统和方法
US20060256615A1 (en) Horizontal and vertical error correction coding (ECC) system and method
US7290185B2 (en) Methods and apparatus for reducing memory errors
US8990646B2 (en) Memory error test routine
US11080135B2 (en) Methods and apparatus to perform error detection and/or correction in a memory device
CN104956443A (zh) Ram刷新率
KR20170054182A (ko) 반도체 장치
CN112306737A (zh) 控制易失性存储器装置的修复的方法和存储装置
WO2023142429A1 (zh) 一种易失性存储介质不可纠正错误的预测方法和相关设备
JP2013196393A (ja) 演算処理装置及び演算処理装置の制御方法
US9690649B2 (en) Memory device error history bit
CN111522684A (zh) 一种同时纠正相变存储器软硬错误的方法及装置
CN114730607A (zh) 一种存储器故障修复方法及装置
CN112181712B (zh) 一种提高处理器核可靠性的方法及装置
US10628258B2 (en) Die-level error recovery scheme
Lee et al. ECMO: ECC Architecture Reusing Content-Addressable Memories for Obtaining High Reliability in DRAM
Jones et al. Holistic energy efficient crosstalk mitigation in DRAM
US20230297285A1 (en) Row hammer telemetry

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923233

Country of ref document: EP

Kind code of ref document: A1