WO2021253708A1 - 内存故障的处理方法、装置、设备及存储介质 - Google Patents

内存故障的处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021253708A1
WO2021253708A1 PCT/CN2020/126112 CN2020126112W WO2021253708A1 WO 2021253708 A1 WO2021253708 A1 WO 2021253708A1 CN 2020126112 W CN2020126112 W CN 2020126112W WO 2021253708 A1 WO2021253708 A1 WO 2021253708A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
failure
fault
row
threshold
Prior art date
Application number
PCT/CN2020/126112
Other languages
English (en)
French (fr)
Inventor
乔光毅
刁阳彬
马剑涛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20940708.9A priority Critical patent/EP3979079A4/en
Publication of WO2021253708A1 publication Critical patent/WO2021253708A1/zh
Priority to US17/582,802 priority patent/US20220148674A1/en

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/48Arrangements in static stores specially adapted for testing by means external to the store, e.g. using direct memory access [DMA] or using auxiliary access paths
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • G11C29/4401Indication or identification of errors, e.g. for repair for self repair
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/1201Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details comprising I/O circuitry
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/18Address generation devices; Devices for accessing memories, e.g. details of addressing circuits
    • G11C29/24Accessing extra cells, e.g. dummy cells or redundant cells
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/76Masking faults in memories by using spares or by reconfiguring using address translation or modifications
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/78Masking faults in memories by using spares or by reconfiguring using programmable devices

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular, to a method, device, device, and storage medium for processing memory failures.
  • Memory is one of the important components of the device.
  • the memory includes multiple banks (also called storage matrix), and each bank includes multiple memory rows.
  • each bank includes multiple memory rows.
  • the repair of the memory line failure can be used as an important repair method in the memory failure.
  • the device After the device performs a cold reset, that is, after the device is restarted or the user manually restarts the device, the device will perform a memory self-test. If a memory failure is detected in a memory line, it is considered that a memory line failure has occurred, and the failure has occurred The memory line is called the fault line. At this time, it can be determined whether the number of correctable error (CE) type memory failures that occurred on the failure line recorded in the failure log has reached the threshold. If the threshold is reached, it is determined that the current meet Post package repair (hPPR) conditions, start hPPR, and replace the faulty row with the redundant row on the bank where the faulty row is located, so as to realize the repair of the memory row failure.
  • CE Correctable error
  • the embodiments of the present application provide a memory failure processing method, device, equipment, and storage medium, which can repair the memory failure in time, prevent system downtime, and reduce the impact on business.
  • the technical solution is as follows:
  • a method for processing memory failures includes:
  • the fault analysis includes: obtaining the current fault analysis result of the memory by analyzing the historical fault information, where the historical fault information is the fault information accumulated in the memory in the historical time period, and the historical time period is The time period before the first time or the time period before and including the first time; the fault repair of the memory is initiated according to the current fault analysis result of the memory.
  • the fault analysis result is obtained by analyzing the historical fault information, and then the memory fault repair is performed according to the fault analysis result.
  • This solution can analyze the memory fault more accurately, and can start the fault repair of the memory without a cold reset , Prevent system downtime and reduce business impact.
  • the first moment is the moment before the UCE failure occurs in the computer system. That is, the failure analysis of the memory is started during the operation of the computer system, and the operation period of the computer system refers to the period of normal operation of the computer system.
  • the first moment includes: a moment of periodic startup according to a preset condition; and/or, after the computer system is running, a moment when a memory failure occurs in the memory is determined.
  • the computer device when it detects that a memory failure occurs, it starts to analyze historical failure information to obtain the failure analysis result. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results, and if a memory failure is detected within the cycle interval, then analyzes the historical failure information to obtain the failure analysis results, and use the time when the memory failure is detected this time Restart the cycle analysis if applicable. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results, and if a memory failure is detected within the periodic interval, then analyzes the historical failure information to obtain the failure analysis results, but does not use the memory failure detected this time The cycle analysis is restarted at the time, that is, the cycle analysis is not affected.
  • computer equipment periodically analyzes historical failure information, can predict the severity of memory failures in time, and repair memory failures in time.
  • the embodiment of the present application analyzes historical fault information through a fault analysis model to determine the fault analysis result, that is, the computer device obtains the current fault analysis result of the memory by analyzing the historical fault information, including: inputting historical fault information into the fault Analyze the model to obtain the current failure analysis result of the memory, and the failure analysis model is an intelligent calculation analysis model.
  • analyzing historical fault information through a fault analysis model is only an implementation method for analyzing historical fault information provided by the embodiment of this application, and computer equipment can also analyze historical fault information through other implementation methods, for example, based on data statistics. This embodiment of the application does not limit this.
  • the realization method of obtaining the failure analysis result of the computer equipment through the failure analysis model or other methods is introduced.
  • the computer device starts the failure repair of the memory according to the current failure analysis result of the memory.
  • the repair includes: replacing the faulty row with the redundant row, and repairing the data on the redundant row.
  • the computer device obtaining the current failure analysis result of the memory includes: obtaining a first statistical feature according to historical failure information, the first statistical feature indicating the number of failure bits in the first memory row in the historical time period, and the first The memory row is any memory row.
  • the first statistical characteristic is greater than the first threshold, it is determined that the failure mode is a memory row failure.
  • the first threshold represents the number of fault bits that each memory row can tolerate.
  • the fault analysis model includes a first threshold
  • the computer device inputs the historical fault information into the fault analysis model
  • the fault analysis model obtains the first statistical feature according to the historical fault information.
  • the fault analysis result also includes the fault level
  • the computer device starts the fault repair of the memory according to the current fault analysis result of the memory. Fault repair.
  • obtaining the current failure analysis result of the memory by the computer device further includes: obtaining a second statistical feature and/or a third statistical feature according to historical failure information, where the second statistical feature represents every occurrence of the first memory row in the historical time period.
  • the number of failures of different types of failures the third statistical feature represents the number of error corrections that occurred in the first memory row in the historical time period; when the second statistical feature is greater than the second threshold, or when the third statistical feature is greater than the third threshold, Alternatively, when the second statistical feature is greater than the second threshold and the third statistical feature is greater than the third threshold, it is determined that the fault level is a high risk level.
  • the second threshold represents the number of faults of each type of fault that each memory line can tolerate
  • the third threshold represents the number of error corrections that each memory line can tolerate.
  • the fault analysis model further includes the second threshold and/or the third threshold.
  • the computer device inputs the historical fault information into the fault analysis model, and the fault analysis model obtains the second statistical feature and/or the third statistical feature according to the historical fault information.
  • the historical fault information also includes the fault type and fault correction information of the memory fault that occurred in the historical time period.
  • the fault types include CE type and UCE type.
  • the CE type includes a patrol CE type, a read CE type, and so on.
  • the error correction information includes information such as the amount of error correction data (also called error correction data, unit such as bit), error correction code, etc. for correcting each memory failure sent.
  • a risk mode option is displayed on the interactive interface, and the risk mode option includes a memory high risk mode option and a memory low risk mode option. That is, the computer equipment provides an interactive interface, and the user can select the risk mode through the interactive interface.
  • the first threshold, the second threshold, and the third threshold are variables set according to the risk mode.
  • the first threshold of the memory high-risk mode is less than the first threshold of the memory low-risk mode; and/or the second threshold of the memory high-risk mode is less than the second threshold of the memory low-risk mode; and/or, the memory is high
  • the third threshold of the risk mode is less than the third threshold of the low memory risk mode.
  • the duration of the historical time period is a variable set according to the risk mode, and the duration of the historical time period of the memory high-risk mode is less than the duration of the historical time period of the memory low-risk mode.
  • the user can flexibly select the risk mode according to the needs. For example, if the user’s business risk is high, the high-risk mode can be selected. In this way, the first threshold and/or the second threshold and/or the third threshold are higher. Low and/or the historical time period is relatively short, the computer equipment obtains the first statistical feature, the second statistical feature, and/or the third statistical feature by analyzing the historical fault information in the short period of time, and compares the obtained data with the smaller ones. The threshold value is compared to analyze whether it is a memory line failure and a high risk level, so that the computer equipment can ensure that the less serious memory line failure is recognized in time. If the user's business risk is low, you can choose the low-risk mode, which can ensure high recognition, that is, identify more serious memory line faults in time.
  • the computer device provides an interactive interface for the user to select the risk mode.
  • the computer device determines the duration of the fault information to be analyzed and/or the threshold value in the threshold judgment according to the risk mode selected by the user, and calculates the corresponding period of time.
  • the failure mode is identified as the memory line failure, the memory failure can be repaired in time. In this way, the method of comparing the risk mode selected by the user with the threshold value is integrated to accurately predict the memory line failure while reducing the computational pressure of the computer equipment.
  • the failure analysis result includes a failure mode
  • the computer device starts the memory failure repair according to the current failure analysis result of the memory including: when the failure mode is a memory line failure, initiating a failure to the memory Repairing, wherein the fault repairing includes: replacing the faulty row with a redundant row, and repairing the data on the redundant row. That is, when the computer device determines that the failure mode is a memory line failure, it replaces the failed line with a redundant line in the memory and repairs the fault data.
  • the fault analysis result also includes the fault level
  • the computer device starts the fault repair of the memory according to the current fault analysis result of the memory.
  • Memory failure repaired That is, when the computer equipment determines that the failure mode is a memory line failure and the failure level is a high risk level, it replaces the failure line with the redundant line in the memory and repairs the failure data.
  • the redundant row and the failed row are located on the same bank in the memory. That is, the computer equipment replaces the failed row with the redundant row on the bank where the failed row is located.
  • the computer equipment repairs the data on the redundant row, including: performing a read operation on the redundant row; if the data read from the redundant row is erroneous data, correct the erroneous data to correct The subsequent data is written back to the redundant line to achieve the restoration of the data on the redundant line. That is, in the embodiment of the present application, the faulty data is repaired through the read operation of the redundant row and the data write-back.
  • the i-th segment on the line performs a read operation until i is equal to M. That is, the computer equipment repairs the data on the redundant row by successively reading, correcting and writing back in segments.
  • the method further includes: generating a correctable error CE; suppressing the CE.
  • a CE will be generated in the computer equipment, and the computer equipment will suppress the CE. That is, because the computer device detects the wrong data when reading the redundant row, the computer device will think that a CE has been detected. Since the CE is not caused by the computer's memory failure, the CE needs to be suppressed, that is, it is not processed The CE, or the computer equipment does not record the CE.
  • the method further includes: releasing the suppression operation of the CE.
  • the CE generated after the computer equipment has repaired the redundant row is caused by a real memory failure. Therefore, the CE needs to be processed, that is, the suppression operation of the CE is released, and the CE is recorded.
  • the computer equipment starts the realization of the failure repair of the memory: when the failure mode is a memory failure, or when the failure mode is a memory
  • the memory failure repair is initiated.
  • the failure repair is to replace the failed row with the redundant row and repair the data on the redundant row.
  • the computer device obtains the fault analysis result by analyzing the fault information of the second bank in the historical time period. Accordingly, the computer device initiates the implementation of the fault repair of the memory as follows: In the fault mode, the memory bank is faulty. When the failure mode is a memory bank failure and the failure level is a high risk level, the memory failure repair is initiated.
  • the failure repair is to replace the failed bank with a redundant bank to repair the data on the redundant bank.
  • the computer device starts the memory failure repair according to the current memory failure analysis result including: when the failure mode is a memory bank failure, initiates the memory failure repair, where the failure repair includes : Replace the failed bank with a redundant bank and repair the data on the redundant bank.
  • the failure analysis result includes the failure mode and the failure level
  • the computer device starts the memory failure repair according to the current failure analysis result of the memory, including: when the failure mode is a memory bank failure and the failure level is a high risk level, start the memory Fault repair, which includes: replacing the faulty bank with a redundant bank and repairing the data on the redundant bank.
  • the redundant bank and the failed bank are located on the same channel in the memory.
  • the second bank in this embodiment is a concept of one level of the first memory behavior in the previous embodiment, and the previous embodiment is analyzed based on the granularity of the memory row.
  • the historical fault information obtains the fault analysis result.
  • the bank granularity is used to analyze the historical fault information to obtain the fault analysis result.
  • the redundant row is used to replace the failed row.
  • the redundant row and the failed row are on the same bank.
  • the redundant bank is used to replace the failed bank.
  • the redundant bank and the failed bank are located on the same channel in the memory. superior.
  • a memory failure processing device in a second aspect, is provided, and the memory failure processing device has the function of realizing the behavior of the memory failure processing method in the first aspect.
  • the memory failure processing device includes one or more modules, and the one or more modules are used to implement the memory failure processing method provided in the above-mentioned first aspect.
  • a memory failure processing device includes:
  • the analysis module is used to start the fault analysis of the memory at the first moment;
  • the fault analysis includes: obtaining the current fault analysis results of the memory by analyzing the historical fault information, where the historical fault information is the fault information accumulated in the memory in the historical time period ,
  • the historical time period is the time period before the first moment or the time period before and including the first moment;
  • the processing module is used to initiate the fault repair of the memory according to the current fault analysis result of the memory.
  • the first moment is a moment before an uncorrectable error UCE failure occurs in the computer system.
  • the first moment includes:
  • the time of periodic startup according to preset conditions; and/or, after the computer system is running, the time when a memory failure occurs in the memory is determined.
  • the analysis module includes:
  • the analysis sub-module is used to input historical fault information into the fault analysis model to obtain the current fault analysis result of the memory.
  • the fault analysis model is an intelligent calculation analysis model.
  • the processing module includes:
  • the first repair sub-module is used to initiate the repair of the memory failure when the failure mode is the memory line failure, where the failure repair includes: replacing the failed line with the redundant line and repairing the data on the redundant line.
  • the analysis module is specifically used for:
  • the first statistical feature represents the number of fault bits in the first memory line in the historical time period, and the first memory line is any memory line;
  • the first threshold indicates the number of fault bits that each memory row can tolerate.
  • the fault analysis result also includes the fault level
  • the processing module includes:
  • the second repair sub-module is used to initiate the repair of the memory failure when the failure mode is a memory line failure and the failure level is a high risk level.
  • the analysis module is also specifically used for:
  • the second statistical feature represents the number of failures of each type of failure in the first memory row in the historical time period
  • the third statistical feature represents the first statistical feature in the historical time period. The number of error corrections in a memory line
  • the fault level is determined to be High risk level
  • the second threshold indicates the number of failures of each type of failure that each memory line can tolerate
  • the third threshold indicates the number of error corrections that each memory line can tolerate.
  • the device further includes:
  • the interactive module is used to display risk mode options on the interactive interface.
  • the risk mode options include memory high-risk mode options and memory low-risk mode options.
  • the first threshold, the second threshold, and the third threshold are variables set according to the risk mode.
  • the first repair submodule is specifically used for:
  • the erroneous data is corrected, and the corrected data is written back to the redundant row to realize the restoration of the data on the redundant row.
  • the device further includes:
  • Suppression module used to suppress CE.
  • the device further includes:
  • the release module is used to release the suppression operation of the CE after the data on the redundant row is repaired.
  • the processing module includes:
  • the third repair sub-module is used to initiate the repair of the memory failure when the failure mode is a memory bank failure.
  • the failure repair includes: replacing the failed bank with a redundant bank and repairing the data on the redundant bank.
  • the redundant bank and the failed bank are located on the same channel in the memory.
  • a computer device is provided, and a computer program is stored in the computer device, and the computer program implements the memory failure processing method provided in the first aspect when the computer program is run by the computer device.
  • the computer device includes a processor and a memory
  • the memory is used to store a program for executing the memory failure processing method provided in the above first aspect, and to store a program used to implement the memory failure provided in the above first aspect.
  • the processor is configured to execute a program stored in the memory to implement the memory failure processing method provided in the above-mentioned first aspect.
  • the operating device of the storage device may further include a communication bus, and the communication bus is used to establish a connection between the processor and the memory.
  • a computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the memory failure processing method provided in the first aspect.
  • a computer program product containing instructions which when running on a computer, causes the computer to execute the memory failure processing method described in the first aspect.
  • the failure analysis result is obtained by analyzing the historical failure information, and then the memory failure is repaired according to the failure analysis result.
  • This solution can analyze the memory failure more accurately.
  • this solution can start the memory fault repair without a cold reset, that is, it can repair the memory fault in time to prevent system downtime and reduce business impact.
  • FIG. 1 is a flowchart of a method for processing a memory failure provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of performing data repair on redundant rows according to an embodiment of the present application
  • FIG. 3 is a flowchart of another method for processing a memory failure provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of another method for processing a memory failure provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of another method for processing a memory failure provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a memory failure processing device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another memory fault processing device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another memory failure processing device provided by an embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a flowchart of a method for processing a memory failure provided by an embodiment of the present application, and the method is applied to a computer device. Please refer to Figure 1.
  • the method includes the following steps.
  • Step 101 Start the fault analysis of the memory at the first moment, and the fault analysis includes: obtaining the current fault analysis result of the memory by analyzing historical fault information.
  • the basic storage unit of a memory (such as dynamic random access memory (DRAM) DRAM) is usually composed of a transistor and a capacitor, and the amount of charge carried on the capacitor determines whether the basic storage unit is "0" or "1", due to ionized particles in the external environment, or semiconductor hardware defects of the internal transistors, will cause memory errors, that is, memory failures.
  • DRAM dynamic random access memory
  • the memory After the memory fails, the memory itself has error correction algorithms (such as error checking and correcting (error checking and correcting, ECC)) to correct the errors.
  • error checking and correcting error checking and correcting, ECC
  • CE correctable error
  • the error correction algorithm It has a certain error correction capability, but the capability is limited. If the error correction capability of the error correction algorithm is exceeded, an uncorrected error (UCE) will be generated, resulting in equipment downtime.
  • UCE uncorrected error
  • the computer equipment In order to repair memory failures in a timely manner, reduce UCE, and reduce equipment downtime and restart, so as to reduce the impact on the business, the computer equipment analyzes the failure information of the memory failures that occurred in the historical time period to obtain the failure analysis results, and then obtain the failure analysis results according to Fault analysis results, determine whether to deal with memory faults, and how to deal with memory faults.
  • the computer device starts the fault analysis of the memory at the first moment.
  • the fault analysis includes obtaining the current fault analysis result of the memory by analyzing historical fault information.
  • the historical fault information is the memory accumulated in the historical time period.
  • the historical time period is the time period before the first time or the time period before and including the first time.
  • the first moment is the moment before the UCE failure occurs in the computer system. That is, the failure analysis of the memory is started during the operation of the computer system, and the operation period of the computer system refers to the period of normal operation of the computer system.
  • the first moment includes: a moment of periodic startup according to a preset condition; and/or, after the computer system is running, a moment when a memory failure occurs in the memory is determined.
  • the computer device when it detects that a memory failure occurs, it starts to analyze historical failure information to obtain the failure analysis result. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results, and if a memory failure is detected within the cycle interval, then analyzes the historical failure information to obtain the failure analysis results, and use the time when the memory failure is detected this time Restart the cycle analysis if applicable. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results, and if a memory failure is detected within a periodic interval, then analyzes the historical failure information to obtain the failure analysis results, but not based on the memory failure detected this time Restart the cycle analysis based on time, that is, it does not affect the cycle analysis.
  • the computer equipment periodically analyzes historical failure information, predicts the severity of the memory failure in time, and repairs the memory failure in time.
  • the embodiment of the present application obtains the fault analysis result by intelligently analyzing the historical fault information through the fault analysis model, that is, the computer device inputs the historical fault information into the fault analysis model to obtain the current fault analysis result of the memory, and the fault analysis
  • the model is an intelligent calculation analysis model.
  • analyzing historical fault information through a fault analysis model is only an implementation method for analyzing historical fault information provided by the embodiment of this application, and computer equipment can also analyze historical fault information through other implementation methods, for example, based on data statistics.
  • the embodiment of the application does not limit the analysis method used.
  • the realization method of obtaining the failure analysis result of the computer equipment through the failure analysis model or other methods is introduced.
  • the failure analysis result includes a failure mode
  • the failure mode is a memory line failure
  • the computer device initiates a memory failure repair, where the failure repair includes: replacing the failed line with a redundant line, and correcting the redundancy.
  • the data on the line is repaired. That is, when the computer device determines that the current failure mode of the memory is a memory line failure by analyzing historical failure information, it performs memory line replacement and data repair.
  • the historical fault information includes the fault location and the fault time of the memory fault that occurred in the historical time period, and the computer equipment counts the fault location and the fault time included in the historical fault information to analyze the memory fault information and determine the failure mode .
  • the fault location refers to the physical address where the memory fault occurs. It should be noted that each memory failure that occurs is located on a cell. When a memory failure is detected, the cell where the memory failure occurred this time is located in which memory row of which bank, or which row and column of which bank is located, that is This is where the memory failure occurred. Failure time refers to the time when a memory failure occurs.
  • a memory failure log is stored in the computer device, and the memory failure log records failure information of memory failures that occurred within a historical period of time, that is, stores historical failure information.
  • obtaining the current failure analysis result of the memory by the computer device includes: obtaining a first statistical feature according to historical failure information, the first statistical feature indicating the number of failure bits in the first memory row in the historical time period, and the first The memory row is any memory row.
  • the first statistical characteristic is greater than the first threshold, it is determined that the failure mode is a memory row failure.
  • the first threshold represents the number of fault bits that each memory row can tolerate.
  • the fault analysis model includes a first threshold
  • the computer device inputs the historical fault information into the fault analysis model
  • the fault analysis model obtains the first statistical feature according to the historical fault information. That is, the computer device counts the number of fault bits that occur in the first memory line in the historical time period through the fault analysis model, obtains the first statistical feature, and determines the fault mode through threshold judgment.
  • the memory includes multiple banks, each bank includes multiple memory rows, and each memory row includes multiple cells.
  • a cell in the memory that has experienced a memory failure is a fault bit. There may be no memory failure, one memory failure, or more than one memory failure occurred on a cell in the historical time period.
  • the historical failure information includes the failure time and failure location of each memory failure in the historical time period, and computer equipment Count the number of memory faults with different fault locations among the memory faults in the first memory row in the historical time period to obtain the first statistical feature. If the first statistical feature is greater than the first threshold, it means that multiple cells on the first memory row have experienced memory failures, and the computer device determines that the current failure mode of the memory is a memory row failure.
  • computer equipment periodically initiates memory failure analysis, or starts memory failure analysis when memory failure occurs. Based on this, the computer equipment determines that there are many situations in which the first memory line needs to be counted, and this will be introduced next. .
  • the computer device determines the first memory line according to the fault location of the memory failure that occurred this time, and the first memory line refers to the memory line where the memory failure occurred this time. Or, the computer device determines the first bank according to the fault location of the memory failure that occurred this time, and determines a memory row included in the first bank as the first memory row.
  • the first bank refers to the bank where the memory failure occurred this time.
  • the first memory row refers to one of the memory rows included in the first bank.
  • the computer device determines a memory row included in the memory as the first memory row, that is, the first memory row refers to one of the memory rows included in the memory.
  • the computer device determines the first memory line according to the fault location of the most recent memory failure.
  • the first memory line refers to the memory line where the most recent memory failure occurred.
  • the computer device determines the first bank according to the fault location of the most recent memory failure, and determines a memory row included in the first bank as the first memory row.
  • the first bank refers to the bank where the most recent memory failure occurred.
  • the first memory row refers to one of the memory rows included in the first bank.
  • the computer device determines a memory row included in the memory as the first memory row, that is, the first memory row refers to one of the memory rows included in the memory.
  • the computer device is also In the same way as counting the first memory row, statistics are obtained for data corresponding to each memory row in other memory rows, and the first statistical characteristic is determined according to the data obtained by the statistics.
  • the computer device counts the failure information about the first memory line in the historical failure information to obtain a quantity, and the counted quantity is directly used as the first memory line.
  • a statistical feature that is, a first statistical feature is obtained.
  • the computer device uses the maximum value of multiple numbers obtained by statistics as the first statistical feature, or uses each of the multiple numbers as a first statistical feature to obtain multiple first statistical features, each The first statistical feature corresponds to one memory row.
  • the computer device compares the first statistical feature with the first threshold to determine the current failure mode of the memory. For example, in a case where a first statistical feature is obtained, when the first statistical feature is greater than the first threshold, it is determined that the memory mode is a memory row failure. In a case where a plurality of first statistical characteristics are obtained, when at least one first statistical characteristic of the plurality of first statistical characteristics is greater than a first threshold, it is determined that the memory mode is a memory row failure.
  • the failure analysis result also includes a failure level, and when the failure mode is a memory line failure and the failure level is a high-risk level, the computer device initiates the repair of the memory failure.
  • the failure mode is a memory line failure and the failure level is a high-risk level
  • the computer device initiates the repair of the memory failure.
  • obtaining the current failure analysis result of the memory by the computer device further includes: obtaining the second statistical feature and/or the third statistical feature according to the historical failure information, the second statistical feature representing the first memory row in the historical time period The number of failures of each type of failure that occurred, the third statistical feature represents the number of error corrections that occurred in the first memory row in the historical time period, when the second statistical feature is greater than the second threshold, or when the third statistical feature is greater than the third
  • the threshold is reached, or when the second statistical feature is greater than the second threshold and the third statistical feature is greater than the third threshold, it is determined that the fault level is a high risk level.
  • the second threshold represents the number of faults of each type of fault that each memory line can tolerate
  • the third threshold represents the number of error corrections that each memory line can tolerate.
  • the fault analysis model further includes the second threshold and/or the third threshold.
  • the computer device inputs the historical fault information into the fault analysis model, and the fault analysis model obtains the second statistical feature and/or the third statistical feature according to the historical fault information. That is, the computer device counts the number of failures of each type of failure that occurred in the first memory line in the historical time period through the failure analysis model, and obtains the second statistical feature, and/or counts the number of failures that occurred in the first memory line in the historical time period.
  • the number of error corrections is the third statistical feature.
  • the computer device compares the second statistical feature with the second threshold value through the fault analysis model, and/or compares the third statistical feature with the third threshold value to determine the fault level.
  • the historical fault information also includes the fault type and fault correction information of the memory fault that occurred in the historical time period.
  • the fault types include CE type and UCE type.
  • the CE type includes a patrol CE type, a read CE type, and so on.
  • the error correction information includes information such as the amount of error correction data (also called error correction data, unit such as bit), error correction code, etc. for error correction (such as ECC error correction) of each transmitted memory failure.
  • the computer device periodically initiates memory failure analysis, or starts memory failure analysis when a memory failure occurs. Based on this, the computer device counts the failure information of the first memory line in the historical failure information.
  • the second statistical feature and/or the third statistical feature that is, there are many situations in which the computer device determines the first memory row that needs to be counted. In the process of obtaining the first statistical feature by the statistics described above, It is determined that the multiple situations of the first memory row are the same, please refer to the foregoing introduction, and will not repeat them here.
  • the computer device obtains statistics for a memory row corresponding to the data, and uses the statistics obtained directly as the second statistical feature and/or the third statistics feature.
  • the computer device obtains statistics corresponding to multiple memory rows, and the computer device uses the statistics obtained as the second corresponding to the corresponding memory row.
  • Statistical features and/or third statistical features are examples of the first memory row.
  • the computer device compares the second statistical feature with the second threshold, and/or compares the third statistical feature with the third threshold. Compare to determine the current fault level of the memory.
  • the memory row corresponds to one or more second statistical features, and each second statistical feature corresponds to a fault type.
  • one second threshold or multiple second thresholds are stored in the computer device.
  • the fault analysis model includes a second threshold or multiple second thresholds.
  • the computer device compares each of the one or more second statistical features corresponding to each memory row with the second threshold. When the When all or part of the one or more second statistical characteristics is greater than the second threshold, it is determined that the fault level is a high risk level.
  • each second threshold of the multiple second thresholds corresponds to a type of failure, and one or more second statistical characteristics corresponding to each memory row obtained
  • the computer device compares each second statistical feature with a second threshold corresponding to the same fault type, and when all or part of the one or more second statistical features is greater than the corresponding second threshold, the fault level is determined to be high Risk level.
  • the failure types include inspection CE type, read CE type, and UCE type.
  • the memory failure on the first memory line in the historical time period includes 3 inspections of CE type and 1 read of CE type, then computer equipment statistics
  • the first memory row obtains two second statistical characteristics, 3 and 1, respectively. 3 corresponds to the patrol CE type, and 1 corresponds to the read CE type. Assuming that the computer device stores a second threshold value, and the second threshold value is 5, the computer device compares both 3 and 1 with 5, and determines that the fault level is a low risk level.
  • the computer device stores three second thresholds, namely 8, 5, and 2, where 8 corresponds to the inspection CE type, 5 corresponds to the read CE type, and 2 corresponds to the UCE type, then the computer device compares 3 with 8, and Compare 1 with 5, and determine that the fault level is a low risk level.
  • the second statistical feature corresponding to the memory row is greater than when the second threshold and/or the third statistical feature is greater than the third threshold, the fault level is determined to be a high risk level. If the fault information of the memory line is analyzed according to the foregoing method and it is determined that the current fault mode of the memory is a memory line fault, then it is determined The memory behavior is faulty, and the memory fault repair needs to be initiated.
  • the first memory line refers to one of the memory lines included in the first bank or the memory
  • the first statistical feature corresponding to the same memory line is greater than the first statistical feature.
  • the corresponding second statistical feature is greater than the second threshold and/or the third statistical feature is greater than the third threshold, it is determined that the memory behavior is faulty, and it is necessary to start the fault repair of the memory.
  • a risk mode option is displayed on the interactive interface, and the risk mode option includes a memory high risk mode option and a memory low risk mode option. That is, the computer equipment provides an interactive interface, and the user can select the risk mode through the interactive interface.
  • the first threshold, the second threshold, and the third threshold are variables set according to the risk mode.
  • the first threshold of the memory high-risk mode is less than the first threshold of the memory low-risk mode; and/or the second threshold of the memory high-risk mode is less than the second threshold of the memory low-risk mode; and/or, the memory is high
  • the third threshold of the risk mode is less than the third threshold of the low memory risk mode.
  • the duration of the historical time period is a set fixed parameter.
  • the historical time period refers to the time period from the beginning of the computer equipment installation to the analysis of the fault information, or the user configures the historical time period through the computer equipment.
  • the historical time period is configured to be one month long. The time period refers to one month before the analysis of the fault information.
  • the duration of the historical time period is a variable set according to the risk mode, and the duration of the historical time period of the memory high-risk mode is less than the duration of the historical time period of the memory low-risk mode.
  • the computer device when the computer device analyzes that the failure mode is a memory line failure, or when it analyzes that the failure mode is a memory line failure and the failure level is a high risk level, the computer device prompts through an interactive interface that there is a memory failure risk.
  • the user can also modify one or more of the first threshold, the second threshold, the third threshold, and the duration of the historical time period through the interactive interface.
  • the user can flexibly select the risk mode according to the needs. For example, if the user’s business risk is high, the high-risk mode can be selected. In this way, the first threshold and/or the second threshold and/or the third threshold are higher. Low and/or the historical time period is relatively short, the computer equipment obtains the first statistical feature, the second statistical feature, and/or the third statistical feature by analyzing the historical fault information in the short period of time, and compares the obtained data with the smaller ones. The threshold value is compared to analyze whether it is a memory line failure and a high risk level, so that the computer equipment can ensure that the less serious memory line failure is recognized in time. If the user's business risk is low, you can choose the low-risk mode, which can ensure high recognition, that is, identify more serious memory line faults in time.
  • the computer device provides an interactive interface for the user to select the risk mode.
  • the computer device determines the duration of the fault information to be analyzed and/or the threshold value in the threshold judgment according to the risk mode selected by the user, and calculates the corresponding period of time.
  • the failure mode is identified as a memory line failure, or when the failure mode is identified as a memory line failure and the failure level is a high risk level, the memory failure is repaired in time. In this way, the method of comparing the risk mode selected by the user with the threshold value is integrated to accurately predict the memory line failure while reducing the computational pressure of the computer equipment.
  • the computer device collects statistics in a more fine-grained statistical manner. For example, the computer device counts at least one of the maximum number and average number of memory failures of each type of failure on the first memory row in the first time interval, obtains the second statistical feature, and counts the number of memory failures in the first time interval for the first time interval. At least one of the maximum error correction data amount and the average error correction data amount of each type of memory failure on a memory line, and the third statistical feature is obtained.
  • the historical time period includes multiple time intervals, and the first time interval is multiple One of the time intervals.
  • the computer device determines the fault level (risk level or risk level) according to the maximum number of times and/or the average number of times, and the maximum amount of error correction data and/or the average amount of error correction data. For example, when the computer device determines the maximum number of times and the maximum amount of error correction data, when the maximum number of times is greater than or equal to the second threshold, and/or, when the maximum amount of error correction data is greater than or equal to the third threshold, the fault level is determined to be High risk level, where the fault level is divided into low risk level and high risk level. Or, the computer device determines the fault level according to the threshold. Optionally, the fault level is divided into multiple levels, such as level 1, level 2, and level 3. Serious memory risk.
  • the number of averages includes one or more of the arithmetic average, geometric average, and harmonic average.
  • the maximum amount of error correction data in addition to counting the maximum number of times and/or the average number of times, the maximum amount of error correction data, and In addition to the average error correction data volume, other data can also be counted, such as the median value and standard deviation of various data, that is, there are many statistical methods.
  • the embodiment of this application only counts the maximum number of times and average number of times, The maximum error correction data amount and the average error correction data amount are taken as an example for description.
  • the computer device can also determine the fault level
  • the first fault level is stored in the computer device
  • the memory line fault is identified in the computer device
  • the identified fault level is the same as the first fault level, or exceeds the first fault level.
  • the computer equipment automatically repairs the memory line failure.
  • the computer device first displays the current serious memory failure through the interactive interface to prompt the user to choose whether to repair the memory failure, and the computer device determines whether to repair the memory failure according to the user's selection operation.
  • the first failure level stored in the computer device is the default configuration.
  • the first fault level is the fault level selected by the user, that is, the user selects the fault level according to business risk requirements through the interactive interface provided by the computer device in advance.
  • the computer equipment obtains fine-grained statistical characteristics every time to identify failure modes and failure levels, and more accurately predict memory line failures and risk levels.
  • the computer device analyzes historical fault information to determine the failure mode and the failure level can also be achieved by: the computer device uses statistical data to determine the threshold value, determines the failure mode, and analyzes the failure. Model, based on intelligent analysis to determine the fault level.
  • the computer equipment counts the fault time and fault location in the historical fault information, and identifies the fault line mode through threshold comparison.
  • the fault analysis model is used to intelligently analyze the fault location and fault in the historical fault information. Time, fault type and fault correction information to identify the fault level.
  • the computer device provides an interactive interface for the user to select and configure the length of the historical time period, the first threshold, the first failure level, etc., and the computer device accurately predicts the memory line failure according to the configuration selected by the user. Failure level.
  • Step 102 Start the fault repair of the memory according to the current fault analysis result of the memory.
  • the computer device when the failure analysis result includes a failure mode, and the current failure mode of the memory is a memory failure, the computer device initiates a failure repair of the memory.
  • the failure analysis result further includes a failure level, and the failure mode is a memory line failure and the failure level is a high risk level, the memory failure repair is initiated.
  • the fault repair includes: replacing the faulty row with a redundant row in the memory, and repairing the data on the redundant row.
  • the faulty row refers to the memory row where the memory row fails.
  • the faulty row is the first memory row.
  • the computer device can determine the faulty line through threshold judgment or intelligent analysis, and the faulty behavior is on the first bank (or memory). One memory line.
  • the redundant row and the failed row are located on the same bank in the memory, and the computer equipment replaces the failed row with the redundant row on the bank where the failed row is located.
  • the computer device generates a fault isolation request after determining that it is necessary to start the fault repair of the memory, and after generating the fault isolation request, replaces the faulty row with a redundant row in the memory.
  • the user can select the risk mode according to the business risk requirements.
  • the line fault isolation request is generated, indicating that the current memory line fault handling conditions are met.
  • the computer equipment performs memory row replacement.
  • the computer device may further prompt the user to select the memory line fault repair, and the computer device performs the memory fault line replacement after receiving the user's instruction to determine the memory line fault repair.
  • the technology for performing online memory fault line replacement in the embodiment of the present application includes a soft post package repair (soft post package repair, sPPR) technology.
  • a soft post package repair soft post package repair, sPPR
  • the computer device repairs the data on the redundant row in the following manner: perform a read operation on the redundant row, and if the data read from the redundant row is erroneous data, then the erroneous data Make corrections and write the corrected data back to the redundant rows to restore the data on the redundant rows. That is, in the embodiment of the present application, the faulty data is repaired through the read operation of the redundant row and the data write-back.
  • the computer device reads all the data on the memory chip where the redundant row is located by triggering the read operation on the redundant row.
  • the redundant row When the redundant row is read, it reads the Determine whether the data on the redundant row is wrong data, and correct the wrong data based on the other data read.
  • the computer device reads the data in the bank where the redundant row is located and other banks in the memory by triggering a read operation on the redundant row, and performs data correction on the redundant row according to the read data. wrong. That is, which banks or which memory particles the computer device actually reads to perform data error correction on redundant rows. This is the same as the storage algorithm (such as memory interleaving) when the actual memory stores data, and the chip selection of memory read operations. Which bank the signal is connected to is related.
  • the memory read operation is performed in a segmented read mode
  • the computer device is configured with a read interval for the memory read operation by default, for example, the read interval is 4 bits, that is, 4 bits of data are read each time, or The reading interval is one or two cells, that is, the data of one or two cells are read at a time, and the user can also change the default configuration.
  • the reading interval is 4bit.
  • the computer equipment reads 4bit data and repairs it in sequence each time. After repairing, reads the next 4bit data to proceed. Repair until all the data on the redundant row is repaired.
  • each time 4bit data is read for error correction the read 4bit data is error corrected by an error correction algorithm, and the error corrected data is written back to the location where the 4bit data is located.
  • error correction algorithms such as ECC, single device data correction (SDDC), etc.
  • FIG. 2 is a schematic diagram of a method for repairing data on redundant rows through a read operation according to an embodiment of the present application. Referring to Figure 2, the method includes the following steps:
  • Step 201 The computer device performs row address resolution. That is, the computer device analyzes the row address of the failed row and replaces the failed row with the redundant row, that is, the address mapping of the memory data of the failed row to the redundant row, and the data of the redundant row is empty at this time.
  • Step 202 The computer device starts a read operation of the memory area. That is, the computer device reads data on multiple banks, including the first bank where the redundant row is located, through the memory read operation on the redundant row. When the data on the redundant row is read, the computer equipment determines that the data on the redundant row is erroneous data according to the data on the other bank read (shown in the black filled squares).
  • Step 203 The computer device performs data error correction. That is, the computer equipment corrects the erroneous data based on the data read from other banks.
  • Step 204 The computer device performs data write-back. That is, the computer device writes the corrected data back to the redundant row to realize the data repair after the redundant row replaces the failed row.
  • a small square shown in Figure 2 represents 4bit data, and each time the computer device reads the 4bit data on the redundant row for correction, that is, when the computer device reads the redundant row, it reads sequentially A small square included in the redundant row.
  • the second small square shown in Figure 2 is read, that is, the 4bit data at the position of the black filled square, and the data on the other banks are read.
  • the corrected data is obtained, and the corrected data is written back to the position of the black filled square on the redundant row.
  • read a small square after the black filled square on the redundant row that is, the 4bit data in the third small square, perform data error correction, and write back the data to the corresponding position.
  • the computer equipment performs read, correct, and write-back actions in a step-by-step manner to repair the data on the redundant rows.
  • a CE will be generated in the computer equipment, and the computer equipment will suppress the CE.
  • the computer device detects the wrong data when reading the redundant row, the computer device will think that a CE has been detected. Since the CE is not caused by the memory failure of the computer device, the CE needs to be suppressed, that is, it is not. Process the CE, or the computer equipment does not record the CE.
  • the computer device suppresses the CE in the process when the reading operation of the redundant row is triggered and the data repair of the redundant row is completed.
  • the computer device cancels the suppression operation of the CE after completing the repair of the data on the redundant row. That is, the CE generated by the computer equipment after the redundant row is repaired is caused by a real memory failure. Therefore, the CE needs to be processed, that is, the suppression operation of the CE is released, and the CE is recorded.
  • the computer equipment implements the above functions through modules.
  • the computer equipment includes an execution module and a fault identification module.
  • the computer equipment implements the above-mentioned memory fault processing method through the execution module and the fault identification module.
  • the method includes the following step.
  • Step 301 The execution module detects the memory failure, and reports the fault information (including the fault location and the fault time) of the current memory fault to the fault identification module, that is, the CE error report, so as to trigger the fault identification module to start the fault analysis.
  • Step 302 The fault identification module analyzes the memory error, that is, analyzes the historical fault information, such as analyzing the physical address (fault location).
  • Step 303 The fault identification module performs memory fault recognition prediction, that is, analyzes and determines the fault mode, or determines the fault mode and fault level based on the historical fault information, and when the determined fault mode meets the memory fault repair conditions, or when it is determined When the failure mode and failure level meet the conditions for memory failure repair, the execution module is triggered to perform memory failure repair.
  • Step 304 The execution module executes the sPPR to replace the memory fault row, that is, replace the fault row with the redundant row.
  • Step 305 The execution module initiates the read operation of the memory row area, data error correction and data write-back to repair the data on the redundant row, that is, the faulty data is repaired by the memory read operation of the redundant row.
  • Step 306 Execute the CE suppression of the module configuration memory row, so as to suppress the CE during the read operation of the redundant row.
  • Step 307 After the execution module finishes the read operation of the redundant row, the CE suppression is released, that is, the CE suppression is released after the data is repaired.
  • the above-mentioned execution module is a memory control module in a processor included in the computer device (such as a double data rate dynamic random access memory control (DDRC)) memory control module.
  • the identification module is a newly added module on the chip where the BMC is located.
  • the fault identification module can also be added to any processing device included in the computer device.
  • FIG. 4 is a flowchart of another method for processing a memory failure provided by an embodiment of the present application.
  • the method mainly includes error reporting, fault analysis (identification), row replacement, and data write-back.
  • the error reporting process includes: when the execution module detects a memory failure, hardware error correction (such as ECC), and reporting the failure information (including failure time and failure location) of the current memory failure to the failure identification module, and Report the fault information to the module used to record the memory fault log to record the fault information of this memory fault.
  • hardware error correction such as ECC
  • the fault analysis process includes: the fault identification module identifies the fault mode of the memory fault (or identifies the fault mode and the fault level) according to the received fault information and the memory fault log, and identifies the fault mode as a memory fault (or identifies and determines When the failure mode is a memory line failure and the failure level is a high risk level), the execution module is triggered to replace the memory failure line.
  • the process of row replacement includes: executing the module to trigger the memory row replacement, that is, replacing the faulty row with the redundant row.
  • the process of data write-back includes: the execution module performs the read operation of the memory area on the redundant row, corrects the erroneous data on the redundant row through the error correction algorithm, that is, performs data error correction, and writes the corrected data back to the redundant row. I go on.
  • UCE may be generated, causing the computer to report a crash and restart.
  • the failure analysis result is obtained by analyzing historical failure information, and then the memory failure is repaired according to the failure analysis result.
  • This solution can analyze the memory failure more accurately.
  • this solution can start the memory fault repair without a cold reset, that is, it can repair the memory fault in time to prevent system downtime and reduce business impact.
  • the computer device After analyzing the fault information of the first memory line in the historical time period to obtain the fault analysis result, the computer device starts the memory fault repairing method: when the failure mode is a memory line failure, or when the failure mode is a memory line When a fault occurs and the fault level is a high-risk level, the fault repair of the memory is initiated. The fault repair is to replace the faulty row with the redundant row and repair the data on the redundant row. In other embodiments, the computer device analyzes the fault information of the second bank in the historical time period to obtain the fault analysis result.
  • the computer device initiates the implementation of the fault repair of the memory as follows: when the fault mode is a memory bank failure , Or when the failure mode is a memory bank failure and the failure level is a high risk level, the memory failure repair is initiated.
  • the failure repair is to replace the failed bank with a redundant bank to repair the data on the redundant bank.
  • the second bank refers to the bank where the memory row where the memory failure occurs this time, or the second bank refers to the memory where the memory failure occurs this time A bank on the memory particle where the row is located, or the second bank refers to any bank in the memory.
  • the second bank refers to the bank where the memory line where the memory failure occurred most recently is located, or the second bank refers to the memory particle where the memory line where the memory failure occurred most recently is located A bank, or, the second bank refers to any bank in the memory.
  • FIG. 5 is a flowchart of a method for processing a memory failure provided by an embodiment of the present application, and the method is applied to a computer device. Please refer to Figure 5, the method includes the following steps.
  • Step 501 Start the fault analysis of the memory at the first moment, and the fault analysis includes: obtaining the current fault analysis result of the memory by analyzing historical fault information.
  • a computer device when a computer device detects that a memory failure occurs, it analyzes historical failure information to obtain a failure analysis result. Or, the computer equipment periodically analyzes historical failure information to obtain failure analysis results. Or, the computer equipment periodically analyzes the failure information to obtain the failure analysis result, and if a memory failure is detected within the periodic interval, then analyzes the historical failure information to obtain the failure analysis result, and use the time when the memory failure is detected this time as Quasi-restart cycle analysis.
  • the computer equipment periodically analyzes historical failure information to determine the failure mode, and if a memory failure is detected within the periodic interval, then analyzes the historical failure information to obtain the failure analysis result, but does not use the time when the memory failure is detected this time
  • the cycle analysis shall be restarted, that is, the cycle analysis shall not be affected.
  • the historical fault information is fault information of memory faults that occurred in a historical time period, and the duration of the historical time period is the same as or different from the historical time period in the foregoing embodiment. Since it is necessary to analyze whether there is a serious memory bank failure, if the historical time period is longer than the historical time period in the foregoing embodiment, the analysis of the memory bank failure is more accurate to a certain extent.
  • the computer device analyzes the historical fault information through the fault analysis model to obtain the current fault analysis result of the memory, that is, the computer device inputs the historical fault information into the fault analysis model to obtain the current fault of the memory Analysis results
  • the failure analysis model is an intelligent calculation analysis model.
  • the failure analysis result includes the failure mode.
  • the historical fault information includes the fault location and the fault time of the memory fault that occurred in the historical time period.
  • the computer equipment counts the fault location and fault time of historical memory faults, and obtains the number of fault bits in the second bank, that is, obtains the fourth statistical feature.
  • the number of fault bits in the second bank is greater than or
  • the fourth threshold represents the number of fault bits that each bank can tolerate.
  • the fault analysis model includes the fourth threshold.
  • the fault analysis result further includes the fault level
  • the historical fault information also includes the fault type and/or fault correction information of the memory fault that occurred in the historical time period.
  • the computer equipment obtains the fifth statistical feature and/or the sixth statistical feature according to the historical failure information.
  • the fifth statistical feature represents the number of failures of each type of failure in the second bank during the historical period of time
  • the sixth statistical feature represents the historical period of time. The number of error corrections in the second bank.
  • the fifth threshold represents the number of failures of each type of failure that each bank can tolerate
  • the sixth threshold represents the number of error corrections that each bank can tolerate
  • the fault analysis model further includes a fifth threshold and/or a sixth threshold.
  • the duration of the historical time period and/or the fourth threshold and/or the fifth threshold and/or the sixth threshold are variables set according to the risk mode.
  • the risk mode includes a memory high-risk mode and a memory low-risk mode.
  • the historical time period of the memory high-risk mode is shorter than the second time period of the memory low-risk mode; and/or the memory high-risk mode
  • the fourth threshold is less than the second threshold of the memory low risk mode; and/or the fifth threshold of the memory high risk mode is less than the sixth threshold of the memory low risk mode; and/or the sixth threshold of the memory high risk mode is less than the memory low The sixth threshold of the risk model.
  • the computer device also provides an interactive interface, and the risk mode options are displayed on the interactive interface.
  • Risk mode options include high-risk mode options and low-risk mode options. The user can select the risk mode through the interactive interface according to the business risk requirements.
  • the interactive interface is also used to indicate that there is a risk of memory failure when it is confirmed that the failure mode is a memory bank failure.
  • the difference from the embodiment in FIG. 1 is that the second bank in this embodiment and the first memory in the embodiment in FIG. 1 have a concept of one level.
  • the embodiment in FIG. 1 is The failure mode of the memory failure is analyzed with the granularity of the memory row, and the embodiment of FIG. 5 analyzes the failure mode of the memory failure with the granularity of the bank.
  • the computer device in FIG. 5 For the implementation of determining the failure mode by the computer device in FIG. 5, refer to the relevant content in the foregoing embodiment in FIG. 1, which will not be repeated here.
  • Step 502 When the failure mode is the memory bank failure, start the failure repair of the memory, where the failure repair includes: replacing the failed bank with the redundant bank, and repairing the data on the redundant bank.
  • the computer device determines that the failure mode is a memory bank failure, it replaces the failed bank with a redundant bank in the memory and repairs the failure data.
  • the failed bank refers to the bank in which the memory failure has occurred.
  • the redundant bank and the failed bank are located on the same channel in the memory.
  • FIG. 5 is different from the embodiment in FIG. 1 in that in the embodiment in FIG. 1, redundant rows are used to replace failed rows.
  • the redundant rows and the failed rows are on the same bank.
  • redundant banks are used. Replace the failed bank.
  • the redundant bank and the failed bank are located on the same channel in the memory.
  • the memory includes multiple channels (channels), and each channel includes multiple dual inline memory modules (DIMMs).
  • DIMM includes multiple ranks, and one rank includes multiple chip ( Memory particles), a chip includes multiple banks.
  • the current failure mode of the memory is determined by analyzing historical failure information.
  • the failure mode is a memory bank failure
  • the redundant bank is used to replace the failed bank and perform data repair.
  • FIG. 6 is a schematic structural diagram of a memory failure processing device 600 provided by an embodiment of the present application.
  • the memory failure processing device 600 can be implemented as part or all of a computer device by software, hardware, or a combination of the two. It can be the computer device shown in Figure 9 below.
  • the device 600 includes: an analysis module 601 and a processing module 602.
  • the analysis module 601 is used to start the fault analysis of the memory at the first moment; the fault analysis includes: obtaining the current fault analysis result of the memory by analyzing the historical fault information, where the historical fault information is the faults accumulated in the memory in the historical time period Information, the historical time period is the time period before the first time or the time period before the first time and includes the first time; for the specific implementation, refer to the detailed introduction of step 201 in the foregoing embodiment of FIG. 1, which will not be repeated here.
  • the processing module 602 is configured to initiate the fault repair of the memory according to the current fault analysis result of the memory.
  • the processing module 602 is configured to initiate the fault repair of the memory according to the current fault analysis result of the memory.
  • the first moment is a moment before an uncorrectable error UCE failure occurs in the computer system.
  • the first moment includes:
  • the time of periodic startup according to preset conditions; and/or, after the computer system is running, the time when a memory failure occurs in the memory is determined.
  • the analysis module 601 includes:
  • the analysis sub-module is used to input historical fault information into the fault analysis model to obtain the current fault analysis result of the memory.
  • the fault analysis model is an intelligent calculation analysis model.
  • the processing module 602 includes:
  • the first repair sub-module is used to initiate the repair of the memory failure when the failure mode is the memory line failure, where the failure repair includes: replacing the failed line with the redundant line and repairing the data on the redundant line.
  • analysis module 601 is specifically used for:
  • the first statistical feature represents the number of fault bits in the first memory line in the historical time period.
  • the first memory line is any memory line, and the first threshold indicates the tolerance of each memory line.
  • the number of fault bits refer to the detailed introduction of step 101 in the embodiment of FIG. 1 for the specific implementation, which will not be repeated here.
  • the failure mode is a memory line failure.
  • the fault analysis result also includes the fault level
  • the processing module 602 includes:
  • the second repair sub-module is used to initiate the repair of the memory failure when the failure mode is a memory line failure and the failure level is a high risk level.
  • the analysis module 601 is also specifically configured to:
  • the second statistical feature represents the number of failures of each type of failure in the first memory row in the historical time period
  • the third statistical feature represents the first statistical feature in the historical time period.
  • the fault level is determined to be High risk level
  • the second threshold indicates the number of failures of each type of failure that each memory line can tolerate
  • the third threshold indicates the number of error corrections that each memory line can tolerate.
  • the device 600 further includes:
  • the interaction module 603 is configured to display risk mode options on the interactive interface, and the risk mode options include memory high risk mode options and memory low risk mode options.
  • the first threshold, the second threshold, and the third threshold are variables set according to the risk mode.
  • the first repair submodule is specifically used for:
  • step 102 If the data read from the redundant row is erroneous data, the erroneous data is corrected, and the corrected data is written back to the redundant row to realize the restoration of the data on the redundant row.
  • the device 600 further includes:
  • a generating module 604 is used to generate a correctable error CE after the data read from the redundant row is wrong data
  • the suppression module 605 is used to suppress CE.
  • the suppression module 605 is used to suppress CE.
  • the device 600 further includes:
  • the canceling module 606 is used to cancel the suppression operation of the CE after the data on the redundant row is repaired.
  • the canceling module 606 is used to cancel the suppression operation of the CE after the data on the redundant row is repaired.
  • the device 600 further includes:
  • the generating module is used to generate a line fault isolation request after determining that the failure mode is a memory line fault.
  • the redundant row and the failed row are located on the same bank in the memory.
  • the processing module 602 includes:
  • the third repair sub-module is used to initiate the repair of the memory failure when the failure mode is a memory bank failure.
  • the failure repair includes: replacing the failed bank with a redundant bank and repairing the data on the redundant bank.
  • the redundant bank and the failed bank are located on the same channel in the memory.
  • the fault analysis result is obtained by analyzing the historical fault information, and then the memory fault repair is performed according to the fault analysis result.
  • This solution can analyze the memory fault more accurately, and can start the fault repair of the memory without a cold reset , Prevent system downtime and reduce business impact.
  • the memory failure processing device provided in the above embodiment deals with memory failures
  • only the division of the above functional modules is used as an example for illustration.
  • the above functions can be allocated to different functional modules according to needs. Complete, that is, divide the internal structure of the device into different functional modules to complete all or part of the functions described above.
  • the memory failure processing device provided in the foregoing embodiment belongs to the same concept as the memory failure processing method embodiments shown in FIG. 1 to FIG.
  • the embodiment of the present application provides a computer device in which a computer program is stored.
  • the method for processing memory failures in the above-mentioned embodiments of FIGS. 1 to 4 is implemented, or the embodiment of FIG. 5 is implemented. How to deal with memory failures in.
  • FIG. 1 to FIG. 5 How to deal with memory failures in.
  • the computer device includes a processor and a chip where the BMC is located, the processor includes a memory controller, the memory controller includes an execution module, the BMC in the chip where the BMC is located includes a fault identification module, and the memory controller runs an execution module, The corresponding function of the execution module in the embodiment of FIG. 3 is realized, and the BMC runs the fault identification module to realize the corresponding function of the fault identification module in the embodiment of FIG. 3 above.
  • the fault identification module can also be added to the computer equipment including other processing equipment to implement corresponding functions.
  • the computer device obtains the failure analysis result by analyzing the historical failure information, and then repairs the memory failure according to the failure analysis result.
  • This solution can analyze the memory failure more accurately, and can start the memory failure without a cold reset.
  • Fault repair means that memory faults can be repaired in time to prevent system downtime and reduce business impact.
  • the computer equipment provided in the above embodiment deals with memory failures
  • only the division of the above functional modules is used as an example for illustration.
  • the above function allocation can be completed by different functional modules as needed, i.e.
  • the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the computer device provided in the foregoing embodiment belongs to the same concept as the embodiment of the memory failure processing method shown in FIG. 1 or FIG.
  • FIG. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • the computer device includes one or more processors 901, a communication bus 902, a memory 903, and one or more communication interfaces 904.
  • the processor 901 is a general-purpose central processing unit (CPU), network processing (NP), microprocessor, or one or more integrated circuits used to implement the solutions of the present application, for example, dedicated Integrated circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
  • PLD programmable logic device
  • the above-mentioned PLD is a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL), or any of them combination.
  • the communication bus 902 is used to transfer information between the above-mentioned components.
  • the communication bus 902 is divided into an address bus, a data bus, a control bus, and the like.
  • address bus a data bus
  • control bus a control bus
  • only one thick line is used to indicate in the figure, but it does not mean that there is only one bus or one type of bus.
  • the memory 903 is read-only memory (ROM), random access memory (RAM), electrically erasable programmable read-only memory (EEPROM) , Optical discs (including compact disc read-only memory, CD-ROM), compact discs, laser discs, digital universal discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry Or any other medium that stores desired program codes in the form of instructions or data structures and can be accessed by a computer, but is not limited to this.
  • the memory 903 exists independently and is connected to the processor 901 through the communication bus 902, or the memory 903 is integrated with the processor 901.
  • the communication interface 904 uses any device such as a transceiver for communicating with other devices or a communication network.
  • the communication interface 104 includes a wired communication interface, and optionally, a wireless communication interface.
  • the wired communication interface is, for example, an Ethernet interface.
  • the Ethernet interface is an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface is a wireless local area network (WLAN) interface, a cellular network communication interface, or a combination thereof.
  • WLAN wireless local area network
  • the computer device includes multiple processors, and each of these processors is a single-core processor or a multi-core processor.
  • the processor herein refers to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the computer device further includes an output device 906 and an input device 907.
  • the output device 906 communicates with the processor 901 and can display information in a variety of ways.
  • the output device 906 is a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector (projector).
  • the input device 907 communicates with the processor 901 and can receive user input in a variety of ways.
  • the input device 907 is a mouse, a keyboard, a touch screen device, a sensor device, or the like.
  • the memory 903 is used to store the program code 910 for executing the solution of the present application, and the processor 901 can execute the program code 910 stored in the memory 903.
  • the program code includes one or more software modules, and the computer device can implement the memory failure processing method provided in the embodiment of FIG. 1 or FIG. 5 through the processor 901 and the program code 910 in the memory 903.
  • the processor 901 stores program code for executing the solution of the present application, and the processor 901 is used to execute the program code to implement the memory failure processing method provided in the embodiment of FIG. 1 or FIG. Includes one or more software modules.
  • the processor 901 includes a memory controller, and program codes are stored in the memory controller.
  • the memory controller includes the execution module and the fault identification module shown in FIG. The memory failure processing method provided in the example.
  • the processor 901 stores part of the program code for executing the solution of the present application.
  • the processor 901 includes a memory controller, and the memory controller includes the execution module shown in FIG. 3.
  • the computer equipment also includes other processing equipment in addition to the processor 901.
  • the other processing equipment stores another part of the program code that executes the solution of the present application.
  • the processor 901 and other processing equipment implement the above embodiment in FIG. 1 or FIG. 5 together.
  • the provided memory failure processing method for example, the other processing equipment is the chip where the out-of-band motherboard management control unit (baseboard management controller, BMC) is located.
  • BMC includes the fault identification module shown in Figure 3, and the fault identification module is run through the BMC. Together with the memory controller, the memory fault processing method provided in the embodiment of FIG. 1 or FIG. 5 is implemented.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium can be a magnetic medium (for example: floppy disk, hard disk, tape), optical medium (for example: digital versatile disc (DVD)) or semiconductor medium (for example: solid state disk (SSD)) Wait.
  • the computer-readable storage medium mentioned in the embodiment of the present application may be a non-volatile storage medium, in other words, it may be a non-transitory storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种内存故障的处理方法、装置、设备及存储介质,属于计算机技术领域。其中处理方法包括,通过分析历史故障信息得到故障分析结果(101),进而根据故障分析结果对内存进行故障修复(102),从而能够更加精确地分析内存故障。另外,由于无需冷复位即能启动对内存的故障修复,也即能够及时修复内存故障,防止系统宕机,减少业务影响。

Description

内存故障的处理方法、装置、设备及存储介质
本申请要求于2020年6月20日提交的申请号为202010569797.2、发明名称为“一种内存故障处理的方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。本申请还要求于2020年10月29日提交的申请号为202011179463.0、发明名称为“内存故障的处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种内存故障的处理方法、装置、设备及存储介质。
背景技术
内存是设备的重要组成部分之一。通常情况下,内存包括多个bank(也称为存储矩阵),每个bank包括多个内存行。内存在使用的过程中,经常会由于各种原因发生故障,而内存行故障导致的内存故障占比很高,因此,内存行故障的修复可以作为内存故障中的重要修复手段。
在相关技术中,内存中的每个bank上有冗余行。设备在进行冷复位后,也即设备宕机重启或者用户手动重启设备后,设备会进行内存自检,如果检测到内存故障发生在一个内存行上,则认为发生内存行故障,且发生故障的内存行被称为故障行。此时,可以确定故障日志中记录的该故障行上发生的可纠正错误(corrected error,CE)类型的内存故障的次数是否达到阈值,如果达到阈值,则确定当前满足启动硬封装后修复(hard post package repair,hPPR)的条件,启动hPPR,并用故障行所在bank上的冗余行对该故障行进行替换,从而实现内存行故障的修复。
然而,相关技术中,需要设备冷复位后才能启动hPPR进行故障行替换,这样会对业务造成影响,如果设备运行过程中内存故障严重,且一直得不到修复,会导致设备宕机,将严重影响业务。
发明内容
本申请实施例提供了一种内存故障的处理方法、装置、设备及存储介质,能够及时修复内存故障,防止系统宕机,减少对业务的影响。所述技术方案如下:
第一方面,提供了一种内存故障的处理方法,该方法包括:
在第一时刻启动对内存的故障分析;故障分析包括:通过分析历史故障信息,获得内存当前的故障分析结果,其中,历史故障信息为内存在历史时间段内积累的故障信息,历史时间段为第一时刻之前的时间段或者第一时刻之前且包含第一时刻的时间段;根据内存当前的故障分析结果启动对内存的故障修复。
在本申请实施例中,通过分析历史故障信息得到故障分析结果,进而根据故障分析结果 对内存进行故障修复,本方案能够更加精确地分析内存故障,且无需冷复位即能启动对内存的故障修复,防止系统宕机,减少业务影响。
可选地,第一时刻为计算机系统出现UCE故障之前的时刻。也即是,在计算机系统运行期间启动对内存的故障分析,计算机系统运行期间是指计算机系统正常工作期间。
可选地,第一时刻包括:根据预设的条件周期性启动的时刻;和/或,在计算机系统运行之后,确定内存发生内存故障的时刻。
也即是,计算机设备在检测到发生内存故障时,启动分析历史故障信息,获得故障分析结果。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果,以及如果在周期间隔内检测到发生内存故障,则分析历史故障信息,获得故障分析结果,并以本次检测到内存故障的时间为准重新开始周期分析。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果,以及如果在周期间隔内检测到发生内存故障,则分析历史故障信息,获得故障分析结果,但并不以本次检测到内存故障的时间重新开始周期分析,也即不影响周期分析。
需要说明的是,计算机设备周期性地分析历史故障信息,能够及时预测内存故障的严重性,及时修复内存故障。
可选地,本申请实施例通过故障分析模型分析历史故障信息,确定故障分析结果,也即是,计算机设备通过分析历史故障信息,获得内存当前的故障分析结果,包括:将历史故障信息输入故障分析模型,获得内存当前的故障分析结果,故障分析模型为智能计算分析模型。
需要说明的是,通过故障分析模型分析历史故障信息仅为本申请实施例提供的分析历史故障信息的一种实现方式,计算机设备也能够通过其他实现方式分析历史故障信息,例如基于数据统计的方式,本申请实施例对此不作限定。接下来对计算机设备通过故障分析模型或者通过其他方式获得故障分析结果的实现方式进行介绍。
本申请实施例中,故障分析结果包含故障模式,则计算机设备根据内存当前的故障分析结果启动对内存的故障修复包括:在故障模式为内存行故障时,启动对内存的故障修复,其中,故障修复包括:用冗余行替换故障行,对冗余行上的数据进行修复。
本申请实施例中,计算机设备获得内存当前的故障分析结果,包括:根据历史故障信息获得第一统计特征,第一统计特征表示历史时间段内第一内存行出现的故障位的数量,第一内存行是任意内存行,当第一统计特征大于第一阈值时,确定故障模式为内存行故障,第一阈值表示每个内存行能够容忍的故障位的数量。
可选地,假设计算机设备通过故障分析模型分析历史故障信息,那么故障分析模型包括第一阈值,计算机设备将历史故障信息输入故障分析模型,由故障分析模型根据历史故障信息获得第一统计特征。
可选地,故障分析结果还包含故障级别,则计算机设备根据内存当前的故障分析结果启动对内存的故障修复包括:在故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复。
可选地,计算机设备获得内存当前的故障分析结果,还包括:根据历史故障信息获得第二统计特征和/或第三统计特征,第二统计特征表示历史时间段内第一内存行出现的每种故障类型的故障数量,第三统计特征表示历史时间段内第一内存行出现的纠错数量;当第二统计特征大于第二阈值时,或者,当第三统计特征大于第三阈值时,或者,当第二统计特征大于 第二阈值且第三统计特征大于第三阈值时,确定故障级别为高风险级别。其中,第二阈值表示每个内存行能够容忍的每种故障类型的故障数量,第三阈值表示每个内存行能够容忍的纠错数量。
可选地,假设计算机设备通过故障分析模型分析历史故障信息,那么故障分析模型还包括第二阈值和/或第三阈值。计算机设备将历史故障信息输入故障分析模型,由故障分析模型根据历史故障信息获得第二统计特征和/或第三统计特征。
需要说明的是,历史故障信息还包括历史时间段内发生的内存故障的故障类型和故障纠错信息。其中,故障类型包括CE类型和UCE类型。可选地,CE类型包括巡检CE类型、读CE类型等。故障纠错信息包括对每次发送的内存故障进行纠错的纠错数据量(也称为纠错数据,单位如比特bit)、纠错码等信息。
可选地,在交互界面上显示风险模式选项,风险模式选项包括内存高风险模式选项和内存低风险模式选项。也即是,计算机设备提供交互界面,用户可以通过交互界面选择风险模式。
可选地,第一阈值、第二阈值和第三阈值为根据风险模式而设置的变量。
可选地,内存高风险模式的第一阈值小于内存低风险模式的第一阈值;和/或,内存高风险模式的第二阈值小于内存低风险模式的第二阈值;和/或,内存高风险模式的第三阈值小于内存低风险模式的第三阈值。
可选地,历史时间段的时长为根据风险模式而设置的变量,内存高风险模式的历史时间段的时长小于内存低风险模式的历史时间段的时长。
由上述可知,可以由用户灵活地根据需求选择风险模式,例如,如果用户的业务风险较高,则可以选择高风险模式,这样,第一阈值和/或第二阈值和/或第三阈值较低和/或历史时间段较短,计算机设备通过分析较短时间段内的历史故障信息,得到第一统计特征、第二统计特征和/或第三统计特征,将得到的这些数据与较小的阈值进行比较,来分析是否是内存行故障、高风险级别,这样计算机设备可以保证及时识别不太严重的内存行故障。如果用户的业务风险较低,则可以选择低风险模式,这样可以保证高识别,也即及时识别较严重的内存行故障。
在本申请实施例中,计算机设备提供交互界面给用户选择风险模式,计算机设备根据用户选择的风险模式,确定需要分析的故障信息的时长和/或阈值判断时的阈值大小,通过统计相应时长内的故障信息,并进行阈值比较,在识别出故障模式为内存行故障时,及时修复内存故障。这样,将用户选择的风险模式与阈值比较的方法融合,在精准预测内存行故障的同时,减轻计算机设备的计算压力。
在本申请实施例中,由前述可知,故障分析结果包含故障模式,则计算机设备根据内存当前的故障分析结果启动对内存的故障修复包括:在故障模式为内存行故障时,启动对内存的故障修复,其中,故障修复包括:用冗余行替换故障行,对所述冗余行上的数据进行修复。也即是,计算机设备在确定故障模式为内存行故障时,用内存中的冗余行替换故障行,并对故障数据进行修复。
或者,由前述可知,故障分析结果还包含故障级别,则计算机设备根据内存当前的故障分析结果启动对内存的故障修复包括:在故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复。也即是,计算机设备在确定故障模式为内存行故障且故障级别为高 风险级别时,用内存中的冗余行替换故障行,并对故障数据进行修复。
可选地,冗余行和故障行位于内存中的同一个bank上。也即是,计算机设备用故障行所在bank上的冗余行替换故障行。
可选地,计算机设备对冗余行上的数据进行修复,包括:对冗余行执行读操作;如果从冗余行上读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到冗余行上,以实现冗余行上的数据的修复。也即是,在本申请实施例中,通过冗余行的读操作以及数据回写,对故障数据进行修复。
可选地,对冗余行执行读操作,如果从冗余行上读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到冗余行上,包括:将冗余行划分为M段,每段包括一个或多个存储单元,M为大于1的整数;令i=1,对冗余行上的第i段执行读操作;如果从冗余行上的第i段读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到第i段上;如果i不等于M,则令i=i+1,返回对冗余行上的第i段执行读操作,直至i等于M为止。也即是,计算机设备通过分段逐次读取、纠正和回写的方式,对冗余行上的数据进行修复。
可选地,在从冗余行上读取出的数据为错误数据之后,该方法还包括:产生可纠正错误CE;抑制CE。
在本申请实施例中,在从冗余行上读取出的数据为错误数据之后,计算机设备中会产生CE,计算机设备抑制该CE。也即是,由于计算机设备在读取冗余行时,检测到了错误数据,计算机设备会认为检测到了一个CE,由于该CE并非计算机的内存故障导致的,因此需要抑制该CE,也即不处理该CE,或者说计算机设备不记录该CE。
可选地,在对冗余行上的数据修复完成之后,该方法还包括:解除CE的抑制操作。
而在计算机设备在修复完冗余行之后产生的CE是真正内存故障产生的,因此,需要对该CE进行处理,也即解除CE的抑制操作,记录该CE。
前述介绍了在通过分析历史时间段内第一内存行的故障信息获得故障分析结果后,计算机设备启动内存的故障修复的实现方式为:在故障模式为内存行故障时,或者在故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复,故障修复为用冗余行替换故障行,对冗余行上的数据进行修复。在另一些实施例中,计算机设备通过分析历史时间段内第二bank的故障信息来获得故障分析结果,相应地,计算机设备启动对内存的故障修复的实现方式为:在故障模式为内存bank故障时,或者故障模式为内存bank故障且故障级别为高风险级别时,启动对内存的故障修复,故障修复为用冗余bank替换故障bank,对冗余bank上的数据进行修复。
也即是,故障分析结果包含故障模式,则计算机设备根据内存当前的故障分析结果启动对内存的故障修复包括:在故障模式为内存bank故障时,启动对内存的故障修复,其中,故障修复包括:用冗余bank替换故障bank,对冗余bank上的数据进行修复。
或者,故障分析结果包含故障模式和故障级别,则计算机设备根据内存当前的故障分析结果启动对内存的故障修复包括:在故障模式为内存bank故障且故障级别为高风险级别时,启动对内存的故障修复,其中,故障修复包括:用冗余bank替换故障bank,对冗余bank上的数据进行修复。
可选地,冗余bank和故障bank位于内存中的同一channel上。
需要说明的是,该实施例与前述实施例不同的是,该实施例中的第二bank与前述实施例中的第一内存行为一个级别的概念,前述实施例是以内存行的粒度来分析历史故障信息得到故障分析结果,该实施例是以bank的粒度来分析历史故障信息得到故障分析结果。在前述实施例中用冗余行替换故障行,冗余行和故障行在同一个bank上,在该实施例中用冗余bank替换故障bank,冗余bank和故障bank位于内存中的同一channel上。
第二方面,提供了一种内存故障的处理装置,所述内存故障的处理装置具有实现上述第一方面中内存故障的处理方法行为的功能。所述内存故障的处理装置包括一个或多个模块,该一个或多个模块用于实现上述第一方面所提供的内存故障的处理方法。
也即是,提供了一种内存故障的处理装置,该装置包括:
分析模块,用于在第一时刻启动对内存的故障分析;故障分析包括:通过分析历史故障信息,获得内存当前的故障分析结果,其中,历史故障信息为内存在历史时间段内积累的故障信息,历史时间段为第一时刻之前的时间段或者第一时刻之前且包含第一时刻的时间段;
处理模块,用于根据内存当前的故障分析结果启动对内存的故障修复。
可选地,第一时刻为计算机系统出现不可纠正错误UCE故障之前的时刻。
可选地,第一时刻包括:
根据预设的条件周期性启动的时刻;和/或,在计算机系统运行之后,确定内存发生内存故障的时刻。
可选地,分析模块包括:
分析子模块,用于将历史故障信息输入故障分析模型,获得内存当前的故障分析结果,故障分析模型为智能计算分析模型。
可选地,故障分析结果包含故障模式,则处理模块包括:
第一修复子模块,用于在故障模式为内存行故障时,启动对内存的故障修复,其中,故障修复包括:用冗余行替换故障行,对冗余行上的数据进行修复。
可选地,分析模块具体用于:
根据历史故障信息获得第一统计特征,第一统计特征表示历史时间段内第一内存行出现的故障位的数量,第一内存行是任意内存行;
当第一统计特征大于第一阈值时,确定故障模式为内存行故障,第一阈值表示每个内存行能够容忍的故障位的数量。
可选地,故障分析结果还包含故障级别,则处理模块包括:
第二修复子模块,用于在故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复。
可选地,分析模块还具体用于:
根据历史故障信息获得第二统计特征和/或第三统计特征,第二统计特征表示历史时间段内第一内存行出现的每种故障类型的故障数量,第三统计特征表示历史时间段内第一内存行出现的纠错数量;
当第二统计特征大于第二阈值时,或者,当第三统计特征大于第三阈值时,或者,当第二统计特征大于第二阈值且第三统计特征大于第三阈值时,确定故障级别为高风险级别,第二阈值表示每个内存行能够容忍的每种故障类型的故障数量,第三阈值表示每个内存行能够 容忍的纠错数量。
可选地,该装置还包括:
交互模块,用于在交互界面上显示风险模式选项,风险模式选项包括内存高风险模式选项和内存低风险模式选项。
可选地,第一阈值、第二阈值和第三阈值为根据风险模式而设置的变量。
可选地,第一修复子模块具体用于:
对冗余行执行读操作;
如果从冗余行上读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到冗余行上,以实现冗余行上的数据的修复。
可选地,该装置还包括:
产生模块,用于从冗余行上读取出的数据为错误数据之后,产生可纠正错误CE;
抑制模块,用于抑制CE。
可选地,该装置还包括:
解除模块,用于在对冗余行上的数据修复完成之后,解除CE的抑制操作。
可选地,故障分析结果包含故障模式,则处理模块包括:
第三修复子模块,用于在故障模式为内存bank故障时,启动对内存的故障修复,其中,故障修复包括:用冗余bank替换故障bank,对冗余bank上的数据进行修复。
可选地,冗余bank和故障bank位于内存中的同一channel上。
第三方面,提供了一种计算机设备,所述计算机设备中存储有计算机程序,所述计算机程序被计算机设备运行时实现上述第一方面所提供的内存故障的处理方法。
可选地,所述计算机设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的内存故障的处理方法的程序,以及存储用于实现上述第一方面所提供的内存故障的处理方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序,实现上述第一方面所提供的内存故障的处理方法。所述存储设备的操作装置还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所提供的内存故障的处理方法。
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的内存故障的处理方法。
上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
本申请实施例提供的技术方案至少能够带来以下有益效果:
在本申请实施例中,通过分析历史故障信息得到故障分析结果,进而根据故障分析结果对内存进行故障修复,本方案能够更加精确地分析内存故障。另外,本方案无需冷复位即能启动对内存的故障修复,也即能够及时修复内存故障,防止系统宕机,减少业务影响。
附图说明
图1是本申请实施例提供的一种内存故障的处理方法的流程图;
图2是本申请实施例提供的一种对冗余行进行数据修复的示意图;
图3是本申请实施例提供的另一种内存故障的处理方法的流程图;
图4是本申请实施例提供的又一种内存故障的处理方法的流程图;
图5是本申请实施例提供的又一种内存故障的处理方法的流程图;
图6是本申请实施例提供的一种内存故障的处理装置的结构示意图;
图7是本申请实施例提供的另一种内存故障的处理装置的结构示意图;
图8是本申请实施例提供的又一种内存故障的处理装置的结构示意图;
图9是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
图1是本申请实施例提供的一种内存故障的处理方法的流程图,该方法应用于计算机设备。请参考图1,该方法包括如下步骤。
步骤101:在第一时刻启动对内存的故障分析,故障分析包括:通过分析历史故障信息,获得内存当前的故障分析结果。
在本申请实施例中,内存(如动态随机存取存储器(dynamic random access memory,DRAM)DRAM)的基本存储单元通常由一个晶体管和一个电容构成,电容上携带的电荷数量决定该基本存储单元是‘0’还是‘1’,由于外部环境的电离粒子,或者内部晶体管的半导体硬件缺陷,会导致内存发生错误,也即发生内存故障。
内存发生故障以后,内存本身有纠错算法(如错误检查和纠错(error checking and correcting,ECC))来纠正错误,被纠正的错误称为可纠正错误(corrected error,CE),纠错算法具备一定的纠错能力,但是能力有限,如果超过纠错算法的纠错能力,则产生不可纠错误(uncorrected error,UCE),导致设备宕机。
本申请实施例为了及时修复内存故障,减少产生UCE,减少设备宕机重启,以减轻对业务的影响,计算机设备通过分析历史时间段内发生的内存故障的故障信息,得到故障分析结果,之后根据故障分析结果,确定是否处理内存故障,以及如何处理内存故障。
在本申请实施例中,计算机设备在第一时刻启动对内存的故障分析,故障分析包括通过分析历史故障信息,获得内存当前的故障分析结果,其中,历史故障信息为内存在历史时间段内积累的故障信息,历史时间段为第一时刻之前的时间段或者第一时刻之前且包含第一时刻的时间段。
可选地,第一时刻为计算机系统出现UCE故障之前的时刻。也即是,在计算机系统运行期间启动对内存的故障分析,计算机系统运行期间是指计算机系统正常工作期间。
可选地,第一时刻包括:根据预设的条件周期性启动的时刻;和/或,在计算机系统运行之后,确定内存发生内存故障的时刻。
也即是,计算机设备在检测到发生内存故障时,启动分析历史故障信息,获得故障分析结果。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果,以及如果在周期间隔内检测到发生内存故障,则分析历史故障信息,获得故障分析结果,并以本次检测到内存故障的时间为准重新开始周期分析。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果,以及如果在周期间隔内检测到发生内存故障,则分析历史故障信息,获得故障分析结果,但不以本次检测到内存故障的时间为准重新开始周期分析,也即不影响周期分析。
需要说明的是,计算机设备周期性地分析历史故障信息,以及时预测内存故障的严重性,及时修复内存故障。
可选地,本申请实施例通过故障分析模型来智能分析历史故障信息的方式获得故障分析结果,也即是,计算机设备将历史故障信息输入故障分析模型,获得内存当前的故障分析结果,故障分析模型为智能计算分析模型。
需要说明的是,通过故障分析模型分析历史故障信息仅为本申请实施例提供的分析历史故障信息的一种实现方式,计算机设备也能够通过其他实现方式分析历史故障信息,例如基于数据统计的方式,本申请实施例对采用的分析方法不作限定。接下来对计算机设备通过故障分析模型或者通过其他方式获得故障分析结果的实现方式进行介绍。
在本申请实施例中,故障分析结果包含故障模式,则在故障模式为内存行故障时,计算机设备启动对内存的故障修复,其中,故障修复包括:用冗余行替换故障行,对冗余行上的数据进行修复。也即是,计算机设备通过分析历史故障信息确定内存当前的故障模式为内存行故障时,进行内存行替换以及数据修复。
在本申请实施例中,历史故障信息包括历史时间段内发生的内存故障的故障位置和故障时间,计算机设备统计历史故障信息包括的故障位置和故障时间,来分析内存故障信息,并确定故障模式。
其中,故障位置是指发生内存故障的物理地址。需要说明的是,每次发生的内存故障位于一个cell上,当检测到发生内存故障时,本次发生内存故障的cell位于哪个bank的哪个内存行,或者位于哪个bank的哪行哪列,即为本次发生内存故障的故障位置。故障时间是指发生内存故障的时间。
需要说明的是,计算机设备中存储有内存故障日志,内存故障日志中记录有历史时间段内发生的内存故障的故障信息,也即存储有历史故障信息。
在本申请实施例中,计算机设备获得内存当前的故障分析结果包括:根据历史故障信息获得第一统计特征,第一统计特征表示历史时间段内第一内存行出现的故障位的数量,第一内存行是任意内存行,当第一统计特征大于第一阈值时,确定故障模式为内存行故障,第一阈值表示每个内存行能够容忍的故障位的数量。
可选地,假设计算机设备通过故障分析模型分析历史故障信息,那么故障分析模型包括第一阈值,计算机设备将历史故障信息输入故障分析模型,由故障分析模型根据历史故障信息获得第一统计特征。也即是,计算机设备通过故障分析模型统计历史时间段内第一内存行出现的故障位的数量,获得第一统计特征,通过阈值判断确定故障模式。
需要说明的是,内存包括多个bank,每个bank包括多个内存行,每个内存行包括多个cell,内存中发生过内存故障的一个cell即为一个故障位。历史时间段内一个cell上可能未发 生过内存故障,发生过一次内存故障,或者发生过不止一次内存故障,历史故障信息包括历史时间段内每次发生内存故障的故障时间和故障位置,计算机设备统计历史时间段内处于第一内存行的内存故障中故障位置不同的内存故障的数量,得到第一统计特征。如果第一统计特征大于第一阈值,表示第一内存行上有多个cell发生过内存故障,则计算机设备确定内存当前的故障模式为内存行故障。
另外,由前述可知,计算机设备周期性地启动内存故障分析,或者发生内存故障时启动内存故障分析,基于此,计算机设备确定需要统计的第一内存行有多种情况,接下来对此进行介绍。
在检测到发生内存故障而启动内存故障分析的情况下,计算机设备根据本次发生的内存故障的故障位置确定第一内存行,第一内存行是指本次发生的内存故障所在的内存行。或者,计算机设备根据本次发生的内存故障的故障位置确定第一bank,将第一bank包括的一个内存行确定为第一内存行,第一bank是指本次发生的内存故障所在的bank,第一内存行是指第一bank包括的内存行中的一个。或者,计算机设备将内存包括的一个内存行确定为第一内存行,也即第一内存行是指内存包括的内存行中的一个。
在计算机设备周期性地启动内存故障分析的情况下,计算机设备根据最近一次发生内存故障的故障位置确定第一内存行,第一内存行是指最近一次发生的内存故障所在的内存行。或者,计算机设备根据最近一次发生的内存故障的故障位置确定第一bank,将第一bank包括的一个内存行确定为第一内存行,第一bank是指最近一次发生的内存故障所在的bank,第一内存行是指第一bank包括的内存行中的一个。或者,计算机设备将内存包括的一个内存行确定为第一内存行,也即第一内存行是指内存包括的内存行中的一个。
需要说明的是,在第一内存行是指第一bank或内存包括的内存行中的一个的情况下,对于第一bank或内存中除第一内存行之外的其他内存行,计算机设备也按照与统计第一内存行相同的方式,统计得到其他内存行中每个内存行对应的数据,并根据统计得到的数据确定第一统计特征。
在第一内存行是指本次或最近一次发生内存故障的内存行的情况下,计算机设备统计历史故障信息中关于第一内存行的故障信息,得到一个数量,将统计得到的数量直接作为第一统计特征,也即得到一个第一统计特征。在第一内存行是指第一bank或内存包括的内存行中的一个的情况下,计算机设备统计历史故障信息中关于多个第一内存行的故障信息,得到多个数量,每个数量对应一个内存行,计算机设备将统计得到的多个数量的最大值作为第一统计特征,或者将该多个数量中的每个数量作为一个第一统计特征,得到多个第一统计特征,每个第一统计特征对应一个内存行。
在本申请实施例中,计算机设备在得到第一统计特征之后,将第一统计特征与第一阈值进行比较,来确定内存当前的故障模式。例如,在得到一个第一统计特征的情况下,当第一统计特征大于第一阈值时,确定内存模式为内存行故障。在得到多个第一统计特征的情况下,当该多个第一统计特征中的至少一个第一统计特征大于第一阈值时,确定内存模式为内存行故障。
可选地,故障分析结果还包含故障级别,则在故障模式为内存行故障且故障级别为高风险级别时,计算机设备启动对内存的故障修复。接下来介绍计算机设备通过分析历史故障信息确定内存当前的故障级别的实现方式。
在本申请实施例中,计算机设备获得内存当前的故障分析结果,还包括:根据历史故障信息获得第二统计特征和/或第三统计特征,第二统计特征表示历史时间段内第一内存行出现的每种故障类型的故障数量,第三统计特征表示历史时间段内第一内存行出现的纠错数量,当第二统计特征大于第二阈值时,或者,当第三统计特征大于第三阈值时,或者,当第二统计特征大于第二阈值且第三统计特征大于第三阈值时,确定故障级别为高风险级别。其中,第二阈值表示每个内存行能够容忍的每种故障类型的故障数量,第三阈值表示每个内存行能够容忍的纠错数量。
可选地,假设计算机设备通过故障分析模型分析历史故障信息,那么故障分析模型还包括第二阈值和/或第三阈值。计算机设备将历史故障信息输入故障分析模型,由故障分析模型根据历史故障信息获得第二统计特征和/或第三统计特征。也即是,计算机设备通过故障分析模型统计历史时间段内第一内存行出现的每种故障类型的故障数量,获得第二统计特征,和/或,统计历史时间段内第一内存行出现的纠错数量,获得第三统计特征。之后,计算机设备通过故障分析模型将第二统计特征与第二阈值进行比较,和/或,将第三统计特征与第三阈值进行比较,确定故障级别。
需要说明的是,历史故障信息还包括历史时间段内发生的内存故障的故障类型和故障纠错信息。其中,故障类型包括CE类型和UCE类型。可选地,CE类型包括巡检CE类型、读CE类型等。故障纠错信息包括对每次发送的内存故障进行纠错(如ECC纠错)的纠错数据量(也称为纠错数据,单位如比特bit)、纠错码等信息。
在本申请实施例中,由前述可知,在计算机设备周期性地启动内存故障分析,或者发生内存故障时启动内存故障分析,基于此,计算机设备统计历史故障信息中第一内存行的故障信息,获得第二统计特征和/或第三统计特征的实现方式有很多,也即计算机设备确定需要统计的第一内存行的有多种情况,与前述介绍的统计得到第一统计特征的过程中,确定第一内存行的多种情况相同,请参照前述介绍,这里不再赘述。
在第一内存行是指本次或最近一次发生内存故障的内存行的情况下,计算机设备统计得到一个内存行对应的数据,将统计得到的数据直接作为第二统计特征和/或第三统计特征。在第一内存行是指第一bank或内存包括的内存行中的一个的情况下,计算机设备统计得到多个内存行对应的数据,计算机设备将统计得到的数据作为相应内存行对应的第二统计特征和/或第三统计特征。
在本申请实施例中,计算机设备在得到第二统计特征和/或第三统计特征之后,将第二统计特征与第二阈值进行比较,和/或,将第三统计特征与第三阈值进行比较,来确定内存当前的故障级别。
需要说明的是,由于故障类型有很多种,历史故障信息中的故障类型可能有一种或多种,因此,计算机设备需要统计第一内存行出现的一种或多种故障类型的故障数量,得到该内存行对应的一个或多个第二统计特征,且每个第二统计特征对应一种故障类型。
可选地,计算机设备中存储有一个第二阈值或多个第二阈值。例如,故障分析模型包括一个第二阈值或多个第二阈值。
在计算机设备存储有一个第二阈值的情况下,计算机设备将得到的每个内存行对应的一个或多个第二统计特征中的每个第二统计特征均与第二阈值进行比较,当该一个或多个第二统计特征中的全部或部分大于第二阈值时,确定故障级别为高风险级别。
在计算机设备存储有多个第二阈值的情况下,该多个第二阈值中的每个第二阈值对应一种故障类型,对于得到的每个内存行对应的一个或多个第二统计特征,计算机设备将每个第二统计特征与对应相同故障类型的第二阈值进行比较,当该一个或多个第二统计特征中的全部或部分大于对应的第二阈值时,确定故障级别为高风险级别。
示例性地,故障类型包括巡检CE类型、读CE类型和UCE类型,历史时间段内第一内存行上出现的内存故障包括3次巡检CE类型和1次读CE类型,则计算机设备统计第一内存行得到两个第二统计特征分别为3和1,3对应巡检CE类型,1对应读CE类型。假设计算机设备存储有一个第二阈值,第二阈值为5,那么计算机设备将3和1均与5进行比较,确定故障级别为低风险级别。假设计算机设备存储有3个第二阈值,分别为8、5、2,其中,8对应巡检CE类型,5对应读CE类型,2对应UCE类型,那么计算机设备将3与8进行比较,将1与5进行比较,确定故障级别为低风险级别。
需要说明的是,在第一内存行是指本次或最近一次发生内存故障的内存行的情况下,由于仅统计一个内存行对应的数据,这样,当该内存行对应的第二统计特征大于第二阈值,和/或第三统计特征大于第三阈值时,确定故障级别为高风险级别,如果根据前述方法分析该内存行的故障信息确定内存当前的故障模式为内存行故障,则确定该内存行为故障行,需要启动对内存的故障修复。
而在第一内存行是指第一bank或内存包括的内存行中的一个的情况下,由于统计多个内存行分别对应的数据,这样,当同一内存行对应的第一统计特征大于第一阈值,且对应的第二统计特征大于第二阈值和/或第三统计特征大于第三阈值时,确定该内存行为故障行,需要启动对内存的故障修复。
可选地,在交互界面上显示风险模式选项,风险模式选项包括内存高风险模式选项和内存低风险模式选项。也即是,计算机设备提供交互界面,用户可以通过交互界面选择风险模式。
可选地,第一阈值、第二阈值和第三阈值为根据风险模式而设置的变量。
可选地,内存高风险模式的第一阈值小于内存低风险模式的第一阈值;和/或,内存高风险模式的第二阈值小于内存低风险模式的第二阈值;和/或,内存高风险模式的第三阈值小于内存低风险模式的第三阈值。
可选地,历史时间段的时长为设置的固定的参数。例如,历史时间段是指从计算机设备装机运行开始到本次分析故障信息之间的时间段,或者,用户通过计算机设备配置历史时间段的时长,例如配置历史时间段的时长为一个月,历史时间段即指本次分析故障信息之前的一个月时间。
可选地,历史时间段的时长为根据风险模式而设置的变量,内存高风险模式的历史时间段的时长小于内存低风险模式的历史时间段的时长。
可选地,计算机设备在分析出故障模式为内存行故障时,或者在分析出故障模式为内存行故障且故障级别为高风险级别时,通过交互界面提示存在内存故障风险。
可选地,用户还可以通过交互界面修改第一阈值、第二阈值、第三阈值和历史时间段的时长中的一个或多个。
由上述可知,可以由用户灵活地根据需求选择风险模式,例如,如果用户的业务风险较高,则可以选择高风险模式,这样,第一阈值和/或第二阈值和/或第三阈值较低和/或历史时 间段较短,计算机设备通过分析较短时间段内的历史故障信息,得到第一统计特征、第二统计特征和/或第三统计特征,将得到的这些数据与较小的阈值进行比较,来分析是否是内存行故障、高风险级别,这样计算机设备可以保证及时识别不太严重的内存行故障。如果用户的业务风险较低,则可以选择低风险模式,这样可以保证高识别,也即及时识别较严重的内存行故障。
在本申请实施例中,计算机设备提供交互界面给用户选择风险模式,计算机设备根据用户选择的风险模式,确定需要分析的故障信息的时长和/或阈值判断时的阈值大小,通过统计相应时长内的故障信息,并进行阈值比较,在识别出故障模式为内存行故障时,或者在识别出故障模式为内存行故障且故障级别为高风险级别时,及时修复内存故障。这样,将用户选择的风险模式与阈值比较的方法融合,在精准预测内存行故障的同时,减轻计算机设备的计算压力。
可选地,在另一些实施例中,对于第二统计特征和第三统计特征,计算机设备以更细粒度的统计方式来统计数据。例如,计算机设备统计第一时间间隔内每种故障类型的内存故障在第一内存行上出现的最大次数和平均次数中的至少一个,得到第二统计特征,以及统计第一时间间隔内针对第一内存行上每种故障类型的内存故障的最大纠错数据量和平均纠错数据量中的至少一个,得到第三统计特征,历史时间段包括多个时间间隔,第一时间间隔为多个时间间隔中的一个。
计算机设备根据该最大次数和/或平均次数,以及该最大纠错数据量和/或平均纠错数据量,确定故障级别(风险级别或风险等级)。例如,在计算机设备确定最大次数和最大纠错数据量的情况下,在该最大次数大于或等于第二阈值,和/或,最大纠错数据量大于或等于第三阈值时,确定故障级别为高风险级别,其中,故障级别分为低风险级别和高风险级别。或者,计算机设备根据阈值,确定故障级别,可选地,故障级别分为多个级别,例如一级、二级、三级等,一级表示存在较严重的内存风险,三级表示存在不太严重的内存风险。
需要说明的是,在该实施例中,平均次数包括算数平均值、几何平均值、调和平均值中的一个或多个,另外,除了统计最大次数和/或平均次数、最大纠错数据量和/或平均纠错数据量之外,还可以统计其他的数据,例如各种数据的中值、标准差等,也即是统计方式有很多,本申请实施例仅以统计最大次数和平均次数、最大纠错数据量和平均纠错数据量为例进行说明。
可选地,计算机设备还能够确定故障级别的情况下,计算机设备中存储有第一故障级别,在计算机设备识别出内存行故障,且识别出的故障级别与第一故障级别相同,或者超过第一故障级别时,则计算机设备自动修复内存行故障。或者,计算机设备先通过交互界面显示当前存在较严重的内存故障,以提示用户选择是否进行内存故障修复,计算机设备根据用户的选择操作确定是否修复内存行故障。
可选地,计算机设备中存储的第一故障级别为默认配置。或者,第一故障级别为用户选择的故障级别,也即是,用户预先通过计算机设备提供的交互界面根据业务风险需求选择故障级别。
在该实施例中,计算机设备每次都统计获得细粒度的统计特征,来识别出故障模式和故障级别,更加精准地预测内存行故障以及风险等级。
可选地,在其他一些实施例中,计算机设备分析历史故障信息,确定故障模式以及故障 级别的实现方式也可以为:计算机设备通过统计数据进行阈值判断的方式,确定故障模式,以及通过故障分析模型,基于智能分析的方式确定故障级别。在这种实现方式中,计算机设备统计历史故障信息中的故障时间和故障位置等,通过阈值比较的方式识别故障行模式,另外,通过故障分析模型来智能分析历史故障信息中的故障位置、故障时间、故障类型和故障纠错信息,识别出故障级别。可选地,在这种实现方式中,计算机设备提供交互界面给用户选择配置历史时间段的时长、第一阈值、第一故障级别等,计算机设备根据用户选择的配置,精准预测内存行故障以及故障级别。
步骤102:根据内存当前的故障分析结果启动对内存的故障修复。
在本申请实施例中,在故障分析结果包含故障模式,且内存当前的故障模式为内存行故障时,计算机设备启动对内存的故障修复。可选地,在故障分析结果还包含故障级别,且故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复。
在本申请实施例中,故障修复包括:用内存中的冗余行替换故障行,对冗余行上的数据进行修复。
其中,故障行是指发生内存行故障的内存行。例如,在第一内存行是指本次发生的(或者最近一次发生的)内存故障所在的内存行时,故障行即为第一内存行。在第一内存行是指第一bank(或者内存)包括的内存行中的一个时,计算机设备能够通过阈值判断或者智能分析的方式,确定故障行,故障行为第一bank(或者内存)上的一个内存行。
在本申请实施例中,冗余行和故障行位于内存中的同一个bank上,计算机设备用故障行所在bank上的冗余行替换故障行。
可选地,计算机设备在确定需要启动对内存的故障修复之后,还生成行故障隔离请求,在生成行故障隔离请求之后,用内存中的冗余行替换故障行。
由前述可知,用户可以根据业务风险需求选择风险模式,这样在计算机设备根据用户选择的风险模式,并确定故障模式为内存行故障之后,生成行故障隔离请求,表示当前满足内存行故障处理的条件,计算机设备进行内存行替换。可选地,计算机设备还可以再提示用户选择内存行故障修复,计算机设备在接收到用户确定进行内存行故障修复的指令之后,进行内存故障行替换。
可选地,本申请实施例中在线进行内存故障行替换的技术包括软封装后修复(soft post package repair,sPPR)技术。
在本申请实施例中,计算机设备对冗余行上的数据进行修复的实现方式为:对冗余行执行读操作,如果从冗余行上读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到冗余行上,以实现冗余行上的数据的修复。也即是,在本申请实施例中,通过冗余行的读操作以及数据回写,对故障数据进行修复。
需要说明的是,计算机设备通过触发对冗余行的读操作,读取冗余行所在的内存颗粒(chip)上的所有数据,当读到冗余行时,根据读取的该内存颗粒上的其他数据,判断冗余行上的数据是否为错误数据,并根据读取的其他数据,对错误数据进行纠正。在其他一些实施例中,计算机设备通过触发对冗余行的读操作,读取内存中包括冗余行所在的bank以及其他一些bank上的数据,根据读取的数据对冗余行进行数据纠错。也即是,计算机设备实际读取哪些bank或者哪些内存颗粒上的数据来对冗余行进行数据纠错,这与实际内存存储数据时的存储算法(如内存交织)、内存读操作的片选信号连接哪些bank等相关。
在本申请实施例中,内存读操作是以分段读取的方式执行的,计算机设备中默认配置有内存读操作的读间隔,例如读间隔为4bit,也即每次读取4bit数据,或者读间隔为一个或两个cell,也即每次读取一个或两个cell的数据,用户也可以更改默认配置。
例如,读间隔为4bit,对于冗余行的数据,假设冗余行上的数据为100bit,那么计算机设备按照顺序每次读取4bit数据并进行修复,修复之后,再读取下一个4bit数据进行修复,直至将冗余行上的数据全部修复。
可选地,计算机设备将冗余行划分为M段,每段包括一个或多个存储单元,M为大于1的整数。令i=1,对冗余行上的第i段执行读操作,如果从冗余行上的第i段读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到第i段上;如果i不等于M,则令i=i+1,返回对冗余行上的第i段执行读操作,直至i等于M为止。
示例性地,每次读取4bit数据进行纠错,并通过纠错算法对读取的4bit数据进行纠错,将纠错后的数据回写到这4bit数据所在的位置上。
需要说明的是,计算机设备对冗余行执行读操作的过程,通过纠错算法(如ECC、单内存颗粒数据错误纠正(single device data corrction,SDDC)等)对冗余行上的数据进行纠错。
图2是本申请实施例示出的一种通过读操作修复冗余行上数据的方法示意图。参见图2,该方法包括如下步骤:
步骤201:计算机设备进行行地址解析。也即是,计算机设备对故障行进行行地址解析,用冗余行替换故障行,也即将故障行的内存数据的地址映射指向冗余行,此时冗余行的数据为空。
步骤202:计算机设备启动内存区域读操作。也即是,计算机设备通过对冗余行的内存读操作,读取多个bank上的数据,包括冗余行所在的第一bank。在读取到冗余行上的数据时,计算机设备根据读取的其他bank上的数据,确定冗余行上的数据为错误数据(黑色填充方格所示)。
步骤203:计算机设备进行数据纠错。也即是,计算机设备根据读取的其他bank上的数据,对错误数据进行纠正。
步骤204:计算机设备进行数据回写。也即是,计算机设备将纠正后的数据回写到冗余行上,实现冗余行替换故障行后的数据修复。
示例性地,图2所示的一个小方格代表4bit数据,且计算机设备每次读取冗余行上的4bit数据进行纠正,也即计算机设备在读取到冗余行时,依次读取冗余行包括的一个小方格,假设读取到图2所示的第二个小方格,也即黑色填充方格所在位置上的4bit数据,根据读取的其他bank上的数据进行对该黑色填充方格对应的4bit数据纠正之后,得到纠正后的数据,将纠正后的数据回写到冗余行上黑色填充方格所在的位置。之后,再读取冗余行上位于黑色填充方格之后的一个小方格,也即第三个小方格中的4bit数据,并进行数据纠错,以及数据回写到对应的位置。以此类推,计算机设备通过分段逐次的方式,执行读取、纠正和回写的动作,以对冗余行上的数据进行修复。
在本申请实施例中,在从冗余行上读取出的数据为错误数据之后,计算机设备中会产生CE,计算机设备抑制该CE。
也即是,由于计算机设备在读取冗余行时,检测到了错误数据,计算机设备会认为检测到了一个CE,由于该CE并非计算机设备的内存故障导致的,因此需要抑制该CE,也即不 处理该CE,或者说计算机设备不记录该CE。
可选地,计算机设备在触发对冗余行的读操作开始,至对冗余行的数据修复完成时,抑制该过程中的CE。
可选地,计算机设备在对冗余行上的数据修复完成之后,解除CE的抑制操作。也即是,计算机设备在修复完冗余行之后产生的CE是真正内存故障产生的,因此,需要对该CE进行处理,也即解除CE的抑制操作,记录该CE。
需要说明的是,正常情况下,计算机设备每次产生CE,会发生CE中断,将发生的CE的故障信息记录在内存故障日志中,而本申请实施例通过读操作过程中抑制CE,计算机设备即不会在内存故障日志中记录这个过程中的产生的CE的故障信息。
在本申请实施例中,计算机设备通过模块实现以上功能,参见图3,计算机设备包括执行模块和故障识别模块,计算机设备通过执行模块与故障识别模块实现上述内存故障的处理方法,该方法包括如下步骤。
步骤301:执行模块检测内存故障,并上报本次发生的内存故障的故障信息(包括故障位置和故障时间)给故障识别模块,也即CE错误上报,以触发故障识别模块进行启动故障分析。
步骤302:故障识别模块对内存错误进行解析,也即对历史故障信息进行解析,如对物理地址(故障位置)进行解析。
步骤303:故障识别模块进行内存故障识别预测,也即根据历史故障信息,分析确定故障模式,或者确定故障模式和故障级别,并在确定的故障模式满足内存故障修复的条件时,或者在确定的故障模式和故障级别满足内存故障修复的条件时,触发执行模块执行内存的故障修复。
步骤304:执行模块执行sPPR,进行内存故障行替换,也即用冗余行替换故障行。
步骤305:执行模块启动内存行区域读操作、数据纠错以及数据回写以修复冗余行上的数据,也即通过对冗余行的内存读操作对故障数据进行修复。
步骤306:执行模块配置内存行的CE抑制,以在对冗余行的读操作过程中抑制CE。
步骤307:执行模块在对冗余行的读操作结束之后,解除CE抑制,也即在数据修复之后解除CE抑制。
可选地,上述执行模块为计算机设备包括的处理器中的内存控制器(如双倍速率同步动态随机存储控制器(double data rate dynamic random access memory control,DDRC))中的内存控制模块,故障识别模块为BMC所在的芯片上的新增的一个模块。或者,故障识别模块也可以增加在计算机设备包括的任一处理设备中。
图4是本申请实施例提供的又一种内存故障的处理方法的流程图。在图3的基础上,参见图4,该方法主要包括错误上报、故障分析(识别)、行替换和数据回写。
其中,错误上报的过程包括:在执行模块检测到发生内存故障时,硬件纠错(如ECC),并上报本次发生内存故障的故障信息(包括故障时间和故障位置)给故障识别模块,以及将该故障信息上报给用于记录内存故障日志的模块,以记录本次内存故障的故障信息。
故障分析的过程包括:故障识别模块根据接收到的故障信息,以及内存故障日志,识别内存故障的故障模式(或者识别故障模式和故障级别),在识别确定故障模式为内存行故障(或者识别确定故障模式为内存行故障且故障级别为高风险级别)时,触发执行模块进行内存故 障行替换。
行替换的过程包括:执行模块触发内存行替换,也即用冗余行替换故障行。
数据回写的过程包括:执行模块对冗余行执行内存区域读操作,通过纠错算法对冗余行上的错误数据进行纠正,也即进行数据纠错,将纠正后的数据回写到冗余行上。可选地,如果通过纠错算法不能实现对冗余行上的数据修复,则可能产生UCE,导致计算机上报宕机重启。
综上所述,在本申请实施例中,通过分析历史故障信息得到故障分析结果,进而根据故障分析结果对内存进行故障修复,本方案能够更加精确地分析内存故障。另外,本方案无需冷复位即能启动对内存的故障修复,也即能够及时修复内存故障,防止系统宕机,减少业务影响。
前述介绍了在分析历史时间段内第一内存行的故障信息获得故障分析结果后,计算机设备启动内存的故障修复的实现方式为:在故障模式为内存行故障时,或者在故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复,故障修复为用冗余行替换故障行,对冗余行上的数据进行修复。在另一些实施例中,计算机设备分析历史时间段内第二bank的故障信息来获得故障分析结果,相应地,计算机设备启动对内存的故障修复的实现方式为:在故障模式为内存bank故障时,或者故障模式为内存bank故障且故障级别为高风险级别时,启动对内存的故障修复,故障修复为用冗余bank替换故障bank,对冗余bank上的数据进行修复。
其中,在检测到本次发生内存故障而启动内存故障分析的情况下,第二bank是指本次发生内存故障的内存行所在的bank,或者,第二bank是指本次发生内存故障的内存行所在的内存颗粒上的一个bank,或者,第二bank是指内存中的任意一个bank。在周期性地启动内存故障分析的情况下,第二bank是指最近一次发生内存故障的内存行所在的bank,或者,第二bank是指最近一次发生内存故障的内存行所在的内存颗粒上的一个bank,或者,第二bank是指内存中的任意一个bank。
接下来参照图5对该实施例进行介绍。图5是本申请实施例提供的一种内存故障的处理方法的流程图,该方法应用于计算机设备。请参考图5,该方法包括如下步骤。
步骤501:在第一时刻启动对内存的故障分析,故障分析包括:通过分析历史故障信息,获得内存当前的故障分析结果。
在本申请实施例中,计算机设备在检测到发生内存故障时,分析历史故障信息,获得故障分析结果。或者,计算机设备周期性地分析历史故障信息,获得故障分析结果。或者,计算机设备周期性地分析故障信息,获得故障分析结果,以及如果在周期间隔内检测到发生内存故障,则分析历史故障信息,获得故障分析结果,并以本次检测到内存故障的时间为准重新开始周期分析。或者,计算机设备周期性地分析历史故障信息,确定故障模式,以及如果在周期间隔内检测到发生内存故障,则分析历史故障信息,获得故障分析结果,但不以本次检测到内存故障的时间为准重新开始周期分析,也即不影响周期分析。
需要说明的是,历史故障信息为历史时间段内发生的内存故障的故障信息,历史时间段的时长与前述实施例中的历史时间段相同或不同。由于需要分析是否存在较严重内存bank故障,因此,在历史时间段的时长长于前述实施例中的历史时间段的情况下,对内存bank故障 的分析在一定程度上更加精确。
可选地,在本申请实施例中,计算机设备通过故障分析模型分析历史故障信息,获得内存当前的故障分析结果,也即是,计算机设备将历史故障信息输入故障分析模型,获得内存当前的故障分析结果,故障分析模型为智能计算分析模型。
在本申请实施例中,故障分析结果包含故障模式。
可选地,历史故障信息包括历史时间段内发生的内存故障的故障位置和故障时间。计算机设备统计历史内存故障的故障位置和故障时间,得到第二bank出现的故障位的数量,也即获得第四统计特征,当在历史时间段内,第二bank出现的故障位的数量大于或等于第四阈值时,也即第四统计特征大于第四阈值时,确定故障模式为内存bank故障。其中,第四阈值表示每个bank能够容忍的故障位的数量。
可选地,假设计算机设备通过故障分析模型分析历史故障信息,那么故障分析模型包括第四阈值。
可选地,故障分析结果还包含故障级别,历史故障信息还包括历史时间段内发生的内存故障的故障类型和/或故障纠错信息。计算机设备根据历史故障信息获得第五统计特征和/或第六统计特征,第五统计特征表示历史时间段内第二bank出现的每种故障类型的故障数量,第六统计特征表示历史时间段内第二bank出现的纠错数量。当第五统计特征大于第五阈值时,或者,当第六统计特征大于第六阈值时,或者当第五统计特征大于第五阈值且第六统计特征大于第六阈值时,确定故障级别为高风险级别。其中,第五阈值表示每个bank能够容忍的每种故障类型的故障数量,第六阈值表示每个bank能够容忍的纠错数量
可选地,假设计算机设备通过故障分析模型分析历史故障信息,那么故障分析模型还包括第五阈值和/或第六阈值。
可选地,历史时间段的时长和/或第四阈值和/或第五阈值和/或第六阈值为根据风险模式而设置的变量。
可选地,风险模式包括内存高风险模式和内存低风险模式,内存高风险模式的历史时间段的时长短于内存低风险模式的第二时间段的时长;和/或,内存高风险模式的第四阈值小于内存低风险模式的第二阈值;和/或,内存高风险模式的第五阈值小于内存低风险模式的第六阈值;和/或,内存高风险模式的第六阈值小于内存低风险模式的第六阈值。
可选地,计算机设备还提供交互界面,在交互界面上显示风险模式选项。风险模式选项包括高风险模式选项和低风险模式选项。用户可以根据业务风险需求通过交互界面选择风险模式。
可选地,该交互界面还用于在确认故障模式是内存bank故障时,提示存在内存故障风险。
需要说明的是,在该实施例中,与上述图1实施例不同的是,该实施例中的第二bank与图1实施例中的第一内存行为一个级别的概念,图1实施例是以内存行的粒度来分析内存故障的故障模式,图5实施例以bank的粒度来分析内存故障的故障模式。对于图5中计算机设备确定故障模式的实现方式参照前述图1实施例中相关内容,这里不再赘述。
步骤502:在故障模式为内存bank故障时,启动对内存的故障修复,其中,故障修复包括:用冗余bank替换故障bank,对冗余bank上的数据进行修复。
在本申请实施例中,计算机设备如果确定故障模式为内存bank故障,则用内存中的冗余bank替换故障bank,并对故障数据进行修复,故障bank是指发生内存故障的bank。
可选地,冗余bank和故障bank位于内存中的同一channel上。
图5所示实施例中与图1实施例不同的是,图1实施例中用冗余行替换故障行,冗余行和故障行在同一个bank上,图5实施例中用冗余bank替换故障bank,冗余bank和故障bank位于内存中的同一channel上。
需要说明的是,内存包括多个channel(通道),每个channel包括多个双列直插式存储模块(dual inline memory modules,DIMM),一个DIMM包括多个rank,一个rank包括多个chip(内存颗粒),一个chip包括多个bank。
综上所述,在本申请实施例中,通过分析历史故障信息,来确定内存当前的故障模式,在故障模式为内存bank故障的情况下,用冗余bank替换故障bank,并进行数据修复,本方案能够更加精确地识别故障模式,且无需冷复位即能进行内存bank替换,使内存故障及时得到修复,防止系统宕机,减少业务影响。
图6是本申请实施例提供的一种内存故障的处理装置600的结构示意图,该内存故障的处理装置600可以由软件、硬件或者两者的结合实现成为计算机设备的部分或者全部,该计算机设备可以为下文图9所示的计算机设备。参见图6,该装置600包括:分析模块601和处理模块602。
分析模块601,用于在第一时刻启动对内存的故障分析;故障分析包括:通过分析历史故障信息,获得内存当前的故障分析结果,其中,历史故障信息为内存在历史时间段内积累的故障信息,历史时间段为第一时刻之前的时间段或者第一时刻之前且包含第一时刻的时间段;具体实现方式参照前述图1实施例中步骤201的详细介绍,这里不再赘述。
处理模块602,用于根据内存当前的故障分析结果启动对内存的故障修复。具体实现方式参照前述图1实施例中步骤102的详细介绍,这里不再赘述。
可选地,第一时刻为计算机系统出现不可纠正错误UCE故障之前的时刻。
可选地,第一时刻包括:
根据预设的条件周期性启动的时刻;和/或,在计算机系统运行之后,确定内存发生内存故障的时刻。
可选地,分析模块601包括:
分析子模块,用于将历史故障信息输入故障分析模型,获得内存当前的故障分析结果,故障分析模型为智能计算分析模型。
可选地,故障分析结果包含故障模式,则处理模块602包括:
第一修复子模块,用于在故障模式为内存行故障时,启动对内存的故障修复,其中,故障修复包括:用冗余行替换故障行,对冗余行上的数据进行修复。具体实现方式参照前述图1实施例中步骤102的详细介绍,这里不再赘述。
可选地,分析模块601具体用于:
根据历史故障信息获得第一统计特征,第一统计特征表示历史时间段内第一内存行出现的故障位的数量,第一内存行是任意内存行,第一阈值表示每个内存行能够容忍的故障位的数量;具体实现方式参照前述图1实施例中步骤101的详细介绍,这里不再赘述。
当第一统计特征大于第一阈值时,确定故障模式为内存行故障。
可选地,故障分析结果还包含故障级别,则处理模块602包括:
第二修复子模块,用于在故障模式为内存行故障且故障级别为高风险级别时,启动对内存的故障修复。
可选地,分析模块601还具体用于:
根据历史故障信息获得第二统计特征和/或第三统计特征,第二统计特征表示历史时间段内第一内存行出现的每种故障类型的故障数量,第三统计特征表示历史时间段内第一内存行出现的纠错数量;具体实现方式参照前述图1实施例中步骤101的详细介绍,这里不再赘述。
当第二统计特征大于第二阈值时,或者,当第三统计特征大于第三阈值时,或者,当第二统计特征大于第二阈值且第三统计特征大于第三阈值时,确定故障级别为高风险级别,第二阈值表示每个内存行能够容忍的每种故障类型的故障数量,第三阈值表示每个内存行能够容忍的纠错数量。
可选地,参见图7,该装置600还包括:
交互模块603,用于在交互界面上显示风险模式选项,风险模式选项包括内存高风险模式选项和内存低风险模式选项。
可选地,第一阈值、第二阈值和第三阈值为根据风险模式而设置的变量。
可选地,第一修复子模块具体用于:
对冗余行执行读操作;
如果从冗余行上读取出的数据为错误数据,则对错误数据进行纠正,将纠正后的数据回写到冗余行上,以实现冗余行上的数据的修复。具体实现方式参照前述图1实施例中步骤102的详细介绍,这里不再赘述。
可选地,参见图8,该装置600还包括:
产生模块604,用于从冗余行上读取出的数据为错误数据之后,产生可纠正错误CE;
抑制模块605,用于抑制CE。具体实现方式参照前述图1实施例中步骤102的详细介绍,这里不再赘述。
可选地,参见图8,该装置600还包括:
解除模块606,用于在对冗余行上的数据修复完成之后,解除CE的抑制操作。具体实现方式参照前述图1实施例中步骤102的详细介绍,这里不再赘述。
可选地,该装置600还包括:
生成模块,用于在确定故障模式为内存行故障之后,生成行故障隔离请求。
可选地,冗余行和故障行位于内存中的同一个bank上。
可选地,故障分析结果包含故障模式,则处理模块602包括:
第三修复子模块,用于在故障模式为内存bank故障时,启动对内存的故障修复,其中,故障修复包括:用冗余bank替换故障bank,对冗余bank上的数据进行修复。
可选地,冗余bank和故障bank位于内存中的同一channel上。
在本申请实施例中,通过分析历史故障信息得到故障分析结果,进而根据故障分析结果对内存进行故障修复,本方案能够更加精确地分析内存故障,且无需冷复位即能启动对内存的故障修复,防止系统宕机,减少业务影响。
需要说明的是:上述实施例提供的内存故障的处理装置在处理内存故障时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功 能。另外,上述实施例提供的内存故障的处理装置与图1至图5所示的内存故障的处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例提供了一种计算机设备,该计算机设备中存储有计算机程序,计算机程序被计算机设备运行时实现上述图1至图4实施例中的内存故障的处理方法,或者实现图5实施例中的内存故障的处理方法。具体实现方式参照前述图1至图5所示方法实施例中的详细介绍,这里不再赘述。
可选地,该计算机设备包括处理器和BMC所在的芯片,处理器包括内存控制器,内存控制器中包括执行模块,BMC所在的芯片中的BMC包括故障识别模块,内存控制器运行执行模块,实现上述图3实施例中执行模块相应的功能,BMC运行故障识别模块,实现上述图3实施例中故障识别模块相应的功能。
可选地,故障识别模块除了设置在BMC中,也可以增加在计算机设备包括其他处理设备中,以实现相应功能。
在本申请实施例中,计算机设备通过分析历史故障信息得到故障分析结果,进而根据故障分析结果对内存进行故障修复,本方案能够更加精确地分析内存故障,且无需冷复位即能启动对内存的故障修复,也即能够及时修复内存故障,防止系统宕机,减少业务影响。
需要说明的是:上述实施例提供的计算机设备在处理内存故障时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的计算机设备与图1或图5所示的内存故障的处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图9,图9是根据本申请实施例示出的一种计算机设备的结构示意图。该计算机设备包括一个或多个处理器901、通信总线902、存储器903以及一个或多个通信接口904。
处理器901为一个通用中央处理器(central processing unit,CPU)、网络处理器(network processing,NP)、微处理器、或者为一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。可选地,上述PLD为复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线902用于在上述组件之间传送信息。可选地,通信总线902分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
可选地,存储器903为只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、光盘(包括只读光盘(compact disc read-only memory,CD-ROM)、压缩光盘、激光盘、数字通用光盘、蓝光光盘等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器903独立存在,并通过通信总线902与处理器901相连接, 或者,存储器903与处理器901集成在一起。
通信接口904使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口104包括有线通信接口,可选地,还包括无线通信接口。其中,有线通信接口例如以太网接口等。可选地,以太网接口为光接口、电接口或其组合。无线通信接口为无线局域网(wireless local area networks,WLAN)接口、蜂窝网络通信接口或其组合等。
可选地,在一些实施例中,计算机设备包括多个处理器,这些处理器中的每一个为一个单核处理器,或者一个多核处理器。可选地,这里的处理器指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,计算机设备还包括输出设备906和输入设备907。输出设备906和处理器901通信,能够以多种方式来显示信息。例如,输出设备906为液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备907和处理器901通信,能够以多种方式接收用户的输入。例如,输入设备907是鼠标、键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器903用于存储执行本申请方案的程序代码910,处理器901能够执行存储器903中存储的程序代码910。该程序代码中包括一个或多个软件模块,该计算机设备能够通过处理器901以及存储器903中的程序代码910,来实现上文图1或图5实施例提供的内存故障的处理方法。
另一些实施例中,处理器901中存储执行本申请方案的程序代码,处理器901用于执行程序代码,实现上文图1或图5实施例提供的内存故障的处理方法,该程序代码中包括一个或多个软件模块。例如处理器901包括内存控制器,内存控制器中存储有程序代码,内存控制器包括图3所示的执行模块和故障识别模块,通过执行模块和故障识别模块实现上文图1或图5实施例提供的内存故障的处理方法。
又一些实施例中,处理器901中存储有执行本申请方案的部分程序代码,例如,处理器901包括内存控制器,内存控制器包括图3所示的执行模块。计算机设备中还包括除处理器901之外的其他处理设备,其他处理设备中存储有执行本申请方案的另一部分程序代码,处理器901与其他处理设备共同实现上文图1或图5实施例提供的内存故障的处理方法,例如,其他处理设备为带外主板管理控制单元(baseboard management controller,BMC)所在的芯片,BMC中包括图3所示的故障识别模块,通过BMC运行故障识别模块,与内存控制器共同实现上文图1或图5实施例提供的内存故障的处理方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算 机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (30)

  1. 一种内存故障的处理方法,其特征在于,所述方法包括:
    在第一时刻启动对内存的故障分析;所述故障分析包括:通过分析历史故障信息,获得所述内存当前的故障分析结果,其中,所述历史故障信息为所述内存在历史时间段内积累的故障信息,所述历史时间段为所述第一时刻之前的时间段或者所述第一时刻之前且包含所述第一时刻的时间段;
    根据所述内存当前的故障分析结果启动对所述内存的故障修复。
  2. 如权利要求1所述的方法,其特征在于,所述第一时刻为计算机系统出现不可纠正错误UCE故障之前的时刻。
  3. 如权利要求1或2所述的方法,其特征在于,所述第一时刻包括:
    根据预设的条件周期性启动的时刻;和/或,在计算机系统运行之后,确定所述内存发生内存故障的时刻。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述通过分析历史故障信息,获得所述内存当前的故障分析结果,包括:
    将所述历史故障信息输入故障分析模型,获得所述内存当前的故障分析结果,所述故障分析模型为智能计算分析模型。
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述故障分析结果包含故障模式,则所述根据所述内存当前的故障分析结果启动对所述内存的故障修复包括:
    在所述故障模式为内存行故障时,启动对所述内存的故障修复,其中,所述故障修复包括:用冗余行替换故障行,对所述冗余行上的数据进行修复。
  6. 如权利要求5所述的方法,其特征在于,所述获得所述内存当前的故障分析结果,包括:
    根据所述历史故障信息获得第一统计特征,所述第一统计特征表示所述历史时间段内第一内存行出现的故障位的数量,所述第一内存行是任意内存行;
    当所述第一统计特征大于第一阈值时,确定所述故障模式为内存行故障,所述第一阈值表示每个内存行能够容忍的故障位的数量。
  7. 如权利要求5或6所述的方法,其特征在于,所述故障分析结果还包含故障级别,则所述根据所述内存当前的故障分析结果启动对所述内存的故障修复包括:
    在所述故障模式为内存行故障且所述故障级别为高风险级别时,启动对所述内存的故障修复。
  8. 如权利要求7所述的方法,其特征在于,所述获得所述内存当前的故障分析结果,还包括:
    根据所述历史故障信息获得第二统计特征和/或第三统计特征,所述第二统计特征表示所述历史时间段内所述第一内存行出现的每种故障类型的故障数量,所述第三统计特征表示所述历史时间段内所述第一内存行出现的纠错数量;
    当所述第二统计特征大于第二阈值时,或者,当所述第三统计特征大于第三阈值时,或者,当所述第二统计特征大于所述第二阈值且所述第三统计特征大于所述第三阈值时,确定所述故障级别为高风险级别,所述第二阈值表示每个内存行能够容忍的每种故障类型的故障数量,所述第三阈值表示每个内存行能够容忍的纠错数量。
  9. 如权利要求7或8所述的方法,其特征在于,所述方法还包括:
    在交互界面上显示风险模式选项,所述风险模式选项包括内存高风险模式选项和内存低风险模式选项。
  10. 如权利要求9所述的方法,其特征在于,所述第一阈值、第二阈值和第三阈值为根据所述风险模式而设置的变量。
  11. 如权利要求5-10任一项所述的方法,其特征在于,所述对所述冗余行上的数据进行修复,包括:
    对所述冗余行执行读操作;
    如果从所述冗余行上读取出的数据为错误数据,则对所述错误数据进行纠正,将纠正后的数据回写到所述冗余行上,以实现所述冗余行上的数据的修复。
  12. 如权利要求11所述的方法,其特征在于,所述从所述冗余行上读取出的数据为错误数据之后,所述方法还包括:
    产生可纠正错误CE;
    抑制所述CE。
  13. 如权利要求12所述的方法,其特征在于,在对所述冗余行上的数据修复完成之后,所述方法还包括:
    解除所述CE的抑制操作。
  14. 如权利要求1-4任一项所述的方法,其特征在于,所述故障分析结果包含故障模式,则所述根据所述内存当前的故障分析结果启动对所述内存的故障修复包括:
    在所述故障模式为内存bank故障时,启动对所述内存的故障修复,其中,所述故障修复包括:用冗余bank替换故障bank,对所述冗余bank上的数据进行修复。
  15. 一种内存故障的处理装置,其特征在于,所述装置包括:
    分析模块,用于在第一时刻启动对内存的故障分析;所述故障分析包括:通过分析历史 故障信息,获得所述内存当前的故障分析结果,其中,所述历史故障信息为所述内存在历史时间段内积累的故障信息,所述历史时间段为所述第一时刻之前的时间段或者所述第一时刻之前且包含所述第一时刻的时间段;
    处理模块,用于根据所述内存当前的故障分析结果启动对所述内存的故障修复。
  16. 如权利要求15所述的装置,其特征在于,所述第一时刻为计算机系统出现不可纠正错误UCE故障之前的时刻。
  17. 如权利要求15或16所述的装置,其特征在于,所述第一时刻包括:
    根据预设的条件周期性启动的时刻;和/或,在计算机系统运行之后,确定所述内存发生内存故障的时刻。
  18. 如权利要求15-17任一项所述的装置,其特征在于,所述分析模块包括:
    分析子模块,用于将历史故障信息输入故障分析模型,获得所述内存当前的故障分析结果,所述故障分析模型为智能计算分析模型。
  19. 如权利要求15-18任一项所述的装置,其特征在于,所述故障分析结果包含故障模式,则所述处理模块包括:
    第一修复子模块,用于在所述故障模式为内存行故障时,启动对所述内存的故障修复,其中,所述故障修复包括:用冗余行替换故障行,对所述冗余行上的数据进行修复。
  20. 如权利要求19所述的装置,其特征在于,所述分析模块具体用于:
    根据所述历史故障信息获得第一统计特征,所述第一统计特征表示所述历史时间段内第一内存行出现的故障位的数量,所述第一内存行是任意内存行;
    当所述第一统计特征大于第一阈值时,确定所述故障模式为内存行故障,所述第一阈值表示每个内存行能够容忍的故障位的数量。
  21. 如权利要求19或20所述的装置,其特征在于,所述故障分析结果还包含故障级别,则所述处理模块包括:
    第二修复子模块,用于在所述故障模式为内存行故障且所述故障级别为高风险级别时,启动对所述内存的故障修复。
  22. 如权利要求21所述的装置,其特征在于,所述分析模块还具体用于:
    根据所述历史故障信息获得第二统计特征和/或第三统计特征,所述第二统计特征表示所述历史时间段内所述第一内存行出现的每种故障类型的故障数量,所述第三统计特征表示所述历史时间段内所述第一内存行出现的纠错数量;
    当所述第二统计特征大于第二阈值时,或者,当所述第三统计特征大于第三阈值时,或者,当所述第二统计特征大于所述第二阈值且所述第三统计特征大于所述第三阈值时,确定所述故障级别为高风险级别,所述第二阈值表示每个内存行能够容忍的每种故障类型的故障 数量,所述第三阈值表示每个内存行能够容忍的纠错数量。
  23. 如权利要求21或22所述的装置,其特征在于,所述装置还包括:
    交互模块,用于在交互界面上显示风险模式选项,所述风险模式选项包括内存高风险模式选项和内存低风险模式选项。
  24. 如权利要求23所述的装置,其特征在于,所述第一阈值、第二阈值和第三阈值为根据所述风险模式而设置的变量。
  25. 如权利要求19-24任一项所述的装置,其特征在于,所述第一修复子模块具体用于:
    对所述冗余行执行读操作;
    如果从所述冗余行上读取出的数据为错误数据,则对所述错误数据进行纠正,将纠正后的数据回写到所述冗余行上,以实现所述冗余行上的数据的修复。
  26. 如权利要求25所述的装置,其特征在于,所述装置还包括:
    产生模块,用于从所述冗余行上读取出的数据为错误数据之后,产生可纠正错误CE;
    抑制模块,用于抑制所述CE。
  27. 如权利要求26所述的装置,其特征在于,所述装置还包括:
    解除模块,用于在对所述冗余行上的数据修复完成之后,解除所述CE的抑制操作。
  28. 如权利要求15-18任一项所述的装置,其特征在于,所述故障分析结果包含故障模式,则所述处理模块包括:
    第三修复子模块,用于在所述故障模式为内存bank故障时,启动对所述内存的故障修复,其中,所述故障修复包括:用冗余bank替换故障bank,对所述冗余bank上的数据进行修复。
  29. 一种计算机设备,其特征在于,所述计算机设备包括存储器和处理器;
    所述存储器,用于存储计算机程序;
    所述处理器,用于执行所述计算机程序实现权利要求1-14任一项所述的方法。
  30. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-14任一项所述的方法。
PCT/CN2020/126112 2020-06-20 2020-11-03 内存故障的处理方法、装置、设备及存储介质 WO2021253708A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20940708.9A EP3979079A4 (en) 2020-06-20 2020-11-03 METHOD AND DEVICE FOR TREATMENT OF MEMORY DEFECTS, DEVICE AND STORAGE MEDIA
US17/582,802 US20220148674A1 (en) 2020-06-20 2022-01-24 Memory fault handling method and apparatus, device, and storage medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010569797.2 2020-06-20
CN202010569797 2020-06-20
CN202011179463.0 2020-10-29
CN202011179463.0A CN113821364A (zh) 2020-06-20 2020-10-29 内存故障的处理方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/582,802 Continuation US20220148674A1 (en) 2020-06-20 2022-01-24 Memory fault handling method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021253708A1 true WO2021253708A1 (zh) 2021-12-23

Family

ID=78912308

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126112 WO2021253708A1 (zh) 2020-06-20 2020-11-03 内存故障的处理方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20220148674A1 (zh)
EP (1) EP3979079A4 (zh)
CN (1) CN113821364A (zh)
WO (1) WO2021253708A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500235A (zh) * 2022-04-06 2022-05-13 深圳粤讯通信科技有限公司 一种基于物联网的通信设备安全管理系统
CN114726713A (zh) * 2022-03-02 2022-07-08 阿里巴巴(中国)有限公司 节点故障模型训练方法、检测方法、设备、介质及产品
US20220308768A1 (en) * 2021-03-24 2022-09-29 Yangtze Memory Technologies Co., Ltd. Memory device with failed main bank repair using redundant bank

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7392181B2 (ja) * 2021-03-24 2023-12-05 長江存儲科技有限責任公司 冗長バンクを使用した故障メインバンクの修理を伴うメモリデバイス
CN116166459A (zh) * 2021-11-25 2023-05-26 华为技术有限公司 一种内存硬件故障检测方法、装置以及内存控制器
US11994951B2 (en) * 2022-02-23 2024-05-28 Micron Technology, Inc. Device reset alert mechanism
CN115168087B (zh) * 2022-07-08 2024-03-19 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置
CN115394344A (zh) * 2022-07-22 2022-11-25 超聚变数字技术有限公司 一种确定内存故障修复方式的方法、装置及存储介质
CN115686901B (zh) * 2022-10-25 2023-08-04 超聚变数字技术有限公司 内存故障分析方法及计算机设备
CN117672328B (zh) * 2024-02-02 2024-04-09 深圳市奥斯珂科技有限公司 固态硬盘的数据恢复方法、装置、设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4899342A (en) * 1988-02-01 1990-02-06 Thinking Machines Corporation Method and apparatus for operating multi-unit array of memories
CN101329918A (zh) * 2008-07-30 2008-12-24 中国科学院计算技术研究所 存储器内建自修复系统及自修复方法
WO2009126812A1 (en) * 2008-04-09 2009-10-15 Inapac Technology, Inc. Programmable memory repair scheme
CN103514068A (zh) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 内存故障自动定位方法
CN109086151A (zh) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 一种服务器上隔离内存故障的方法及装置
CN110598802A (zh) * 2019-09-26 2019-12-20 腾讯科技(深圳)有限公司 一种内存检测模型训练的方法、内存检测的方法及装置
CN111008091A (zh) * 2019-12-06 2020-04-14 苏州浪潮智能科技有限公司 一种内存ce的故障处理方法、系统及相关装置
CN111312321A (zh) * 2020-03-02 2020-06-19 电子科技大学 一种存储器装置及其故障修复方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6408401B1 (en) * 1998-11-13 2002-06-18 Compaq Information Technologies Group, L.P. Embedded RAM with self-test and self-repair with spare rows and columns
TW594775B (en) * 2001-06-04 2004-06-21 Toshiba Corp Semiconductor memory device
CN1598780A (zh) * 2003-09-16 2005-03-23 蔚华科技股份有限公司 以故障模式为导向的内存缺陷诊断方法及其系统
JP2010033678A (ja) * 2008-07-30 2010-02-12 Toshiba Storage Device Corp ディスク装置、回路基板およびエラーログ情報記録方法
US8788097B2 (en) * 2009-06-22 2014-07-22 Johnson Controls Technology Company Systems and methods for using rule-based fault detection in a building management system
US8621324B2 (en) * 2010-12-10 2013-12-31 Qualcomm Incorporated Embedded DRAM having low power self-correction capability
US10372551B2 (en) * 2013-03-15 2019-08-06 Netlist, Inc. Hybrid memory system with configurable error thresholds and failure analysis capability
US20190019569A1 (en) * 2016-01-28 2019-01-17 Hewlett Packard Enterprise Development Lp Row repair of corrected memory address
KR20180075218A (ko) * 2016-12-26 2018-07-04 에스케이하이닉스 주식회사 메모리 수리 방법 및 장치
JP7236231B2 (ja) * 2018-09-07 2023-03-09 ルネサスエレクトロニクス株式会社 半導体装置及び解析システム
US11862271B2 (en) * 2018-12-17 2024-01-02 Arm Limited Memory testing techniques
KR20210026201A (ko) * 2019-08-29 2021-03-10 삼성전자주식회사 반도체 메모리 장치, 이를 포함하는 메모리 시스템 및 이의 리페어 제어 방법
US11837314B2 (en) * 2020-02-19 2023-12-05 Sk Hynix Nand Product Solutions Corp. Undo and redo of soft post package repair

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4899342A (en) * 1988-02-01 1990-02-06 Thinking Machines Corporation Method and apparatus for operating multi-unit array of memories
WO2009126812A1 (en) * 2008-04-09 2009-10-15 Inapac Technology, Inc. Programmable memory repair scheme
CN101329918A (zh) * 2008-07-30 2008-12-24 中国科学院计算技术研究所 存储器内建自修复系统及自修复方法
CN103514068A (zh) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 内存故障自动定位方法
CN109086151A (zh) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 一种服务器上隔离内存故障的方法及装置
CN110598802A (zh) * 2019-09-26 2019-12-20 腾讯科技(深圳)有限公司 一种内存检测模型训练的方法、内存检测的方法及装置
CN111008091A (zh) * 2019-12-06 2020-04-14 苏州浪潮智能科技有限公司 一种内存ce的故障处理方法、系统及相关装置
CN111312321A (zh) * 2020-03-02 2020-06-19 电子科技大学 一种存储器装置及其故障修复方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3979079A4

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220308768A1 (en) * 2021-03-24 2022-09-29 Yangtze Memory Technologies Co., Ltd. Memory device with failed main bank repair using redundant bank
US11726667B2 (en) * 2021-03-24 2023-08-15 Yangtze Memory Technologies Co., Ltd. Memory device with failed main bank repair using redundant bank
CN114726713A (zh) * 2022-03-02 2022-07-08 阿里巴巴(中国)有限公司 节点故障模型训练方法、检测方法、设备、介质及产品
CN114726713B (zh) * 2022-03-02 2024-01-12 阿里巴巴(中国)有限公司 节点故障模型训练方法、检测方法、设备、介质及产品
CN114500235A (zh) * 2022-04-06 2022-05-13 深圳粤讯通信科技有限公司 一种基于物联网的通信设备安全管理系统
CN114500235B (zh) * 2022-04-06 2022-07-26 深圳粤讯通信科技有限公司 一种基于物联网的通信设备安全管理系统

Also Published As

Publication number Publication date
EP3979079A4 (en) 2023-02-08
US20220148674A1 (en) 2022-05-12
EP3979079A1 (en) 2022-04-06
CN113821364A (zh) 2021-12-21

Similar Documents

Publication Publication Date Title
WO2021253708A1 (zh) 内存故障的处理方法、装置、设备及存储介质
US10235233B2 (en) Storage error type determination
US7971112B2 (en) Memory diagnosis method
US20210389956A1 (en) Memory error processing method and apparatus
US7971124B2 (en) Apparatus and method for distinguishing single bit errors in memory modules
US20080301530A1 (en) Apparatus and method for distinguishing temporary and permanent errors in memory modules
US9606889B1 (en) Systems and methods for detecting memory faults in real-time via SMI tests
US11080135B2 (en) Methods and apparatus to perform error detection and/or correction in a memory device
US20230185659A1 (en) Memory Fault Handling Method and Apparatus
US9645904B2 (en) Dynamic cache row fail accumulation due to catastrophic failure
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
WO2024007765A1 (zh) 一种确定内存故障的修复资源粒度的方法及装置
WO2024082844A1 (zh) 一种内存条故障检测装置及检测方法
US9965346B2 (en) Handling repaired memory array elements in a memory of a computer system
US8984333B2 (en) Automatic computer storage medium diagnostics
JP5618204B2 (ja) 障害処理装置、それを用いた情報処理装置及び情報処理装置の障害処理方法
CN115394344A (zh) 一种确定内存故障修复方式的方法、装置及存储介质
CN114996065A (zh) 内存故障预测方法、装置及设备
WO2023193396A1 (zh) 一种内存故障处理方法、装置及计算机可读存储介质
CN115421946A (zh) 内存故障处理方法、装置和存储介质
CN115421947A (zh) 内存故障处理方法、装置和存储介质
CN115269245B (zh) 一种内存故障处理方法及计算设备
CN116483612B (zh) 内存故障处理方法、装置、计算机设备和存储介质
WO2023093173A1 (zh) 一种内存硬件故障检测方法、装置以及内存控制器
CN116069578A (zh) 一种内存故障预警方法和计算设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020940708

Country of ref document: EP

Effective date: 20211230

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940708

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE