WO2023093173A1 - 一种内存硬件故障检测方法、装置以及内存控制器 - Google Patents

一种内存硬件故障检测方法、装置以及内存控制器 Download PDF

Info

Publication number
WO2023093173A1
WO2023093173A1 PCT/CN2022/115593 CN2022115593W WO2023093173A1 WO 2023093173 A1 WO2023093173 A1 WO 2023093173A1 CN 2022115593 W CN2022115593 W CN 2022115593W WO 2023093173 A1 WO2023093173 A1 WO 2023093173A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
error
memory controller
target location
memory
Prior art date
Application number
PCT/CN2022/115593
Other languages
English (en)
French (fr)
Inventor
马剑涛
刁阳彬
谈峥
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023093173A1 publication Critical patent/WO2023093173A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/56External testing equipment for static stores, e.g. automatic test equipment [ATE]; Interfaces therefor
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/56External testing equipment for static stores, e.g. automatic test equipment [ATE]; Interfaces therefor
    • G11C29/56008Error analysis, representation of errors

Definitions

  • the present application relates to the field of storage technologies, and in particular to a memory hardware fault detection method, device and memory controller.
  • Memory is one of the important parts of the computer, which is used to temporarily store the calculation data of the processor in the computer and the data exchanged with external memory such as hard disk.
  • the memory controller is located inside the computer and is used to manage the memory and to exchange data between the memory and the processor.
  • the memory controller When the memory controller reads data under the instruction of the processor, the memory controller can verify the read data. Determine whether there is an error in the data, and if the read data is incorrect, the memory controller will report the error to the processor.
  • the processor performs decision-making processing, such as alarm and reset processing.
  • the present application provides a memory hardware fault detection method, device and memory controller, which are used to improve the accuracy of error reporting by the memory controller.
  • the embodiment of the present application provides a memory hardware fault detection method, and the method may be executed by a memory controller.
  • the memory controller can read data from the target location in the memory under the instructions of the processor. If an error is found in the data, it will correct the data at the target location and write it to the target location. In order to further determine the type of error occurring at the target location, the memory controller may read the first data from the target location after the data at the target location is written to the target location after error correction. If it is determined that there is an error in the first data, the memory controller may report an error message indicating that a hardware failure occurs at the target location.
  • the memory controller After the memory controller finds that the data at the target position is wrong, it will correct the data, write back and reread the data, so as to further determine the type of error that occurred at the target position.
  • a hardware failure occurs at the target position Report error information. Reduce the frequent reporting of error information due to soft failures, and improve the accuracy of memory controller reporting errors.
  • the memory controller writes the data at the target position in the memory to the target position after error correction.
  • the processor may initiate an instruction to the memory controller to read data from a target location.
  • the memory controller can read the second data from the target location.
  • the memory controller determines that the second data has an error, it corrects the second data, and writes the corrected data into the target location.
  • the memory controller can read data from the target location under the instructions of the processor, and then determine the type of error occurred at the target location through operations such as error correction, write back, and rereading when an error is found in the data. In order to ensure the accuracy of the error message reported later.
  • the memory controller finds that the first data is error-free, it indicates that the error at the target location may be caused by a soft failure. In this case, the memory controller can record the target location.
  • the memory controller may report a soft failure message indicating that the errors caused by soft failures at the target location have exceeded the threshold.
  • the memory controller only records the target location for errors caused by soft failures, which can reduce the frequency of reporting error messages to a certain extent.
  • a read isolation configuration can be performed on the target location, and the read isolation configuration can shield components other than the memory controller from accessing the target location.
  • a read operation of the location in the case of the read isolation configuration, the memory controller reads the first data from the target location of the memory. After the first data is read, the read isolation configuration may be released.
  • the read isolation configuration is performed on the target position before the first data is read, and the read isolation configuration is released after the first data is read, which can reduce other components from reading the first data with errors.
  • a write isolation configuration can be performed on the target location, and the write isolation configuration is used to shield components other than the memory controller from accessing the target location. write operation. After the memory controller writes the first data to the target location, the write isolation configuration can be released.
  • the write isolation configuration is performed on the target location before the first data is written, and the write isolation configuration is released after the first data is written, which can reduce the need for other components to modify the first data with errors.
  • the memory controller may perform error correction, write back, and reread the first data multiple times, and if it is determined that the reread data still has an error, it may be determined A hardware failure has occurred at this destination.
  • the embodiment of the present application also provides a fault detection device, which has the function of realizing the behavior in the method example of the first aspect above. repeat.
  • the functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more units corresponding to the above functions.
  • the structure of the fault detection device includes an error correction unit, a reading unit, and a processing unit, and these units can perform the corresponding functions in the method example of the first aspect above, for details, refer to the detailed description in the method example , which will not be described here.
  • the embodiment of the present application also provides a memory controller, the memory controller has the function of implementing the behavior in the method example of the first aspect above, and the beneficial effects can be referred to the description of the first aspect, which will not be repeated here.
  • the structure of the memory controller may include a processing module and a memory, and the processing module is configured to support the memory controller to perform the corresponding functions of the memory controller in the method of the first aspect above .
  • the memory is coupled with the processing module, and stores necessary program instructions and data of the network card.
  • the memory controller may also include an interface for communicating with other components or devices, for example, receiving instructions from the processor, and the like.
  • the structure of the memory controller may also include a processing module and an interface, and the memory controller is configured to support the memory controller to perform corresponding functions in the method of the first aspect above .
  • the processor can also communicate through the interface, for example, receive instructions from the processor, and the like.
  • the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and each possibility of the first aspect.
  • the present application further provides a computer program product including instructions, which, when run on a computer, cause the computer to execute the method described in the above first aspect and each possible implementation manner of the first aspect.
  • the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and implement the above first aspect and each possibility of the first aspect.
  • Figure 1 is a schematic diagram of a single board architecture provided
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a memory hardware fault detection method provided in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a fault detection device provided in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a memory controller provided by an embodiment of the present application.
  • FIG. 1 it is a schematic diagram of a single board in a computer.
  • a central processing unit central processing unit, CPU
  • a memory controller and a memory may be disposed on a single board of a computer.
  • the data that the processor needs to calculate or the data obtained from the computer's external memory can be temporarily stored in the memory.
  • the memory includes multiple storage units, and each storage unit includes a transistor and a capacitor. The state of the capacitor determines whether the data stored in the storage unit is 0 or 1.
  • the memory controller can control the memory to realize data exchange between the memory and the CPU.
  • the CPU may initiate an instruction to the memory controller to request the memory controller to read data from the memory.
  • the memory controller After the memory controller reads the data from the memory, it verifies the read data. If the data verification is passed, the memory controller feeds the data back to the processor. If the data check fails, it means that there is erroneous data, and the memory controller can correct the erroneous data, and if the error correction is successful, the memory controller can feed back the corrected data to the processor.
  • ECC error checking and correction
  • the memory controller When the memory controller needs to write data into the memory, the memory controller will calculate the check code based on the data to be written, and write the data and the check code of the data together. into memory.
  • the memory controller When the memory controller needs to read data from the memory, the memory controller will also read the check code in the memory when reading data from the memory.
  • the memory controller can use the acquired data to recalculate the check code.
  • the memory controller compares the check code read from the memory with the recalculated check code. If they are consistent, the data is normal and the check is passed. If they are not consistent, it means that the data is wrong, a data error has occurred in the data, and the verification has not passed. If the verification fails, the memory controller can correct the read data.
  • Data errors are divided into correctable errors (correctable error, CE) and uncorrectable errors (uncorrectable error, UCE).
  • a correctable error usually refers to a single bit of error data in the data.
  • the memory controller can detect which bit of the data has an error through verification, and the error correction effect can be achieved by flipping the erroneous bit.
  • Some correctable errors that occur in memory do not repeat at a fixed location. That is, this type of correctable error is temporary and can usually be considered to be caused by a soft failure.
  • Soft failure refers to: when high-energy subatomic particles pass through the storage unit in the memory, it will be accompanied by the generation of free charges, and these free charges will gather on the circuit nodes inside the storage unit in a very short time interval. When the accumulated free charge exceeds a certain level, the data stored in the memory cell will change, and a data error will occur. Soft failure is not permanent damage to the circuit in the storage unit. When the data is corrected and written back to the memory, the data error can be recovered.
  • Uncorrectable errors usually refer to multiple bit errors in the data.
  • the memory controller cannot correct errors of this type.
  • the cause of UCE is generally considered to be a hardware failure of the storage unit in the memory. Such as the breakdown of transistors or capacitors in memory cells.
  • the memory controller may report the found CE or UCE to the processor.
  • the processing operation of the CE and UCE is determined by the CPU. Because CEs can usually be corrected, the CPU only needs to record CEs. If there are too many CEs reported and the threshold has been exceeded, the CPU can trigger a board reset. For the UCE, the CPU can report an alarm or reset the board.
  • the memory controller will report all detected CEs and UCEs to the CPU, which will place too much burden on the CPU and occupy more CPU resources. If the CPU performs operations such as reporting an alarm and resetting a board, the CPU will also affect normal services.
  • FIG. 2 it is a schematic structural diagram of a system provided by an embodiment of the present application, and the system includes a memory controller 110 and a memory 120 .
  • the memory controller 110 is used for managing the memory 120 , such as managing storage space in the memory 120 , reading data from the memory 120 or writing data into the memory 120 .
  • the memory controller 110 has the following functions:
  • the memory controller 110 can detect data read from the memory 120 and detect error data in the data.
  • the embodiment of the present application does not limit the manner in which the memory controller 110 detects that there is erroneous data in the read data.
  • the memory controller 110 may use an ECC algorithm to detect the read data, or may use other methods. Any method capable of detecting erroneous data in the data is applicable to this embodiment of the present application.
  • the memory controller 110 can also correct the data to correct the erroneous data in the data, and the memory controller 110 can rewrite the corrected data into the memory 120 in.
  • the memory controller 110 can report data errors.
  • the cause of the erroneous data in the data is usually a failure of the storage unit storing the data, such as a soft failure of the storage unit or a hardware failure of the storage unit.
  • the memory controller 110 in order to control the reporting frequency of the memory controller 110, the memory controller 110 does not need to report all detected data errors, but only reports data errors caused by hardware failures. For example, the memory controller 110 may report that a hardware error occurs at a location storing erroneous data.
  • the memory controller 110 may include a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components, artificial intelligence chips, chips-on-chip, etc.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Discrete gate or transistor logic devices discrete hardware components, artificial intelligence chips, chips-on-chip, etc.
  • the memory 120 may include volatile memory (volatile memory), such as random access memory (random access memory, RAM), dynamic random access memory (dynamic random access memory, DRAM) and the like. It may also include a non-volatile memory (non-volatile memory), such as a storage-class memory (storage-class memory, SCM), etc., or a combination of a volatile memory and a non-volatile memory, etc.
  • volatile memory random access memory
  • RAM random access memory
  • dynamic random access memory dynamic random access memory
  • non-volatile memory non-volatile memory
  • storage-class memory storage-class memory, SCM
  • SCM storage-class memory
  • the embodiment of the present application does not limit the location where the system is deployed as shown in FIG. 2 .
  • the system may reside in a computer, and a processor in the computer may instruct memory controller 110 to write data into memory 120 or read data from memory 120 .
  • the memory controller 110 finds a data error caused by a hardware failure, it may report the error information to the processor, and notify the processor that there is a hardware failure in the location where the data is stored.
  • the memory controller 110 when the memory controller 110 reads data from a certain location in the memory 120, if an error is detected in the data, the memory controller 110 can perform error correction on the data Then write back to that location. The memory controller 110 reads the error-corrected data from the position again. If the memory controller 110 determines that the error-corrected data still has errors, the memory controller 110 can report an error message, and the error message indicates to store the data. A hardware failure has occurred in the location.
  • the memory controller 110 no longer needs to report data errors caused by soft failures, but only reports hardware failures that have a significant impact on the performance of the memory 120, effectively reducing the frequency of reporting error messages by the memory controller 110
  • data errors caused by hardware failures can be accurately identified by writing back the error-corrected data and rereading the error-corrected data, which improves the accuracy of the memory controller 110 reporting.
  • a method for detecting a hardware failure of the memory 120 provided by the embodiment of the present application will be described below with reference to FIG. 3 .
  • Step 301 The processor sends a data read instruction to the memory controller 110, and the data read instruction is used to request to read data from a target location.
  • the data read instruction may carry a logical address of the data, and the logical address of the data may indicate a target location in the memory 120 .
  • Step 302 After receiving the data reading instruction, the memory controller 110 reads the data A from the target location in the memory 120 . After receiving the data read instruction, the memory controller 110 may determine the physical address of the data according to the logical address of the data, and read the data A from the target location indicated by the physical address.
  • Step 303 After reading the data A, the memory controller 110 detects whether there is an error in the data A. If there is no error in the data A, execute step 304; if there is an error in the data A, execute step 305.
  • the memory controller 110 can determine whether there is an error in the data A. It should be noted here that the existence of errors in data A refers to errors in some or all of the data in data A, which is different from data A before it is written into the target location. For the convenience of description, data with errors in data A is referred to as error data.
  • Step 304 the memory controller 110 feeds back a data read response to the processor, and the data read response carries data A.
  • Step 305 the memory controller 110 performs error correction on the data A. If the error correction fails, execute step 306, and if the error correction succeeds, execute step 307.
  • the memory controller 110 can also determine the erroneous data in the data A, and then can determine the location in the target location where the erroneous data is stored.
  • the data A is measured in bits, and the location where the erroneous data is stored may also be referred to as an erroneous bit in this embodiment of the present application.
  • Step 306 The memory controller 110 feeds back the data A to the processor, and reports a first error message to the processor, the first error message indicating that an uncorrectable error occurs on the error bit or a hardware failure occurs on the error bit.
  • data errors are classified into uncorrectable errors and correctable errors, but this embodiment of the present application does not limit the specific classification of data errors.
  • data errors due to hardware failures can also be named by other names.
  • the manner in which the first error message indicates a hardware failure is not limited.
  • the first error message may indicate in a direct manner that a hardware failure has occurred on the error bit, and may also indicate in an indirect manner that an uncorrectable error (or other data error caused by a hardware failure) has occurred in the error bit.
  • Step 307 If the data A is data B after error correction, the memory controller 110 feeds back the data B to the processor.
  • Step 308 The memory controller 110 writes the data B back to the target location.
  • the memory controller 110 may directly write back the data B to the target location, or may only write back data for error bits.
  • the memory controller 110 may perform error correction on the error data, and write back the corrected error data to error bits.
  • Step 309 the memory controller 110 reads data from the target location again, and detects whether there is an error in the read data.
  • the memory controller 110 performs a data re-read operation, and the data re-read operation is not performed under the instruction of the processor, but is implemented by the memory controller 110 itself.
  • the memory controller 110 may read all the data read from the target location, and detect whether there is an error in the data stored in the target location.
  • the memory controller 110 may also only read the data stored on the error bit, and detect whether there is an error in the data stored on the error bit.
  • this data can be the data stored on the target position, or the data stored on the error bit
  • this data can be the data stored on the target position, or the data stored on the error bit
  • Step 310 The memory controller 110 records the error bits. If there is no error in the read data, it means that the error bit may have a soft failure, and only the wrong data appears temporarily. The memory controller 110 may not report, but only record the error bit.
  • Step 311 the memory controller 110 reports a second error message, the second error message indicates that an uncorrectable error occurs in an error bit or a hardware failure exists.
  • the indication manner of the second error message is similar to the indication manner of the first error message, for details, please refer to the foregoing content, and details are not repeated here.
  • the memory controller 110 can determine the error type of the error bit by writing back the error data after error correction and rereading the data from the error bit. For example, it is determined that the fault of the error bit is caused by a soft failure. failure or a hardware failure.
  • the memory controller 110 When the memory controller 110 writes back the error data after error correction to the error bit or rereads the data from the error bit, the memory controller 110 can configure the write isolation configuration for the error bit, and the write isolation configuration is used to shield the memory controller 110 Write operation of the error bit by other components.
  • the memory controller 110 may also configure a read isolation configuration for the error bit, and the read isolation configuration is used to shield components other than the memory controller 110 from reading the error bit.
  • the memory controller 110 can accurately determine the error type of the error bit. In addition, It is also possible to prevent other components from reading wrong data from wrong bits or causing data written to wrong bits to be wrong.
  • the memory controller 110 may initiate an instruction to the processor, and the instruction is used to request to perform write isolation configuration or read isolation configuration on the error bit.
  • the processor can use the internal register of the processor to perform write back pressure configuration and read back pressure configuration for the error bit. In this way, even if the processor needs to write or read the error bit, the processor will not read or read the error bit through the memory controller 110 because the internal register of the processor is configured with write back pressure and read back pressure for the error bit. Or write, so that the effect of writing isolation configuration or reading isolation configuration for error bits is achieved.
  • the memory controller 110 performs a write-back and re-read operation for an error bit as an example.
  • the memory controller 110 determines the error type of the error bit
  • the memory controller 110 It is also possible to perform multiple write-back and re-read operations on the error bit. After it is determined that the error bit has been written back or re-read multiple times, the data read on the error bit still has an error, then it can be determined that the error bit has occurred. Hardware failures are eliminated, and the accuracy of error types is guaranteed.
  • the embodiment of the present application also provides a fault detection device, which is used to execute the method performed by the memory controller in the method embodiment shown in Figure 3 above, and the relevant features can be Refer to the foregoing method embodiments, and details are not repeated here.
  • the fault detection device 400 may be deployed in a memory controller.
  • the fault detection device 400 includes an error correction unit 401 , a reading unit 402 and a processing unit 403 .
  • the error correction unit 401 is configured to write error-corrected data at the target location in the memory to the target location.
  • the reading unit 402 is configured to read the first data from the target position after the error correction unit 401 writes the data at the target position in the memory to the target position after error correction.
  • the processing unit 403 is configured to report an error message when it is determined that there is an error in the first data, and the error message indicates that a hardware failure occurs at the target location.
  • the reading unit 402 can read the second data from the target position under the instruction of the processor. ; When the error correction unit 401 determines that there is an error in the second data, it can perform error correction on the second data, and write the error-corrected data into the target location.
  • the processing unit 403 determines that there is no error in the first data, record the target position.
  • the reading unit 402 before reading the first data, can also perform a read isolation configuration on the target location.
  • the read isolation configuration is used to shield components other than the memory controller from reading the target location. fetch operation; read the first data from the target location of the memory, and release the read isolation configuration after reading the first data.
  • the error correction unit 401 before writing the first data, can also perform a write isolation configuration on the target location, and the write isolation configuration is used to shield components other than the memory controller from writing to the target location. In operation; write the first data to the target location, and release the write isolation configuration after the first data is written.
  • the error correction unit 401 may perform error correction, write back, and reread the data at the target location multiple times, so as to accurately determine the error occurred at the target location.
  • error type For example, the error correction unit 401 may perform error correction on the first data, and write the error-corrected first data into the target location again; after that, the processing unit 403 reads the error-corrected first data from the target location, and after determining the error correction There is an error in the first data after the error.
  • each functional unit in the embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).
  • the memory controller 500 shown in FIG. 5 includes a processing module 501 .
  • a memory 502 and an interface 503 may also be included.
  • the memory 502 can be a volatile memory, such as a random access memory; the memory can also be a nonvolatile memory, such as a read-only memory, a flash memory, a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), or memory 502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • the memory 502 may be a combination of the above-mentioned memories.
  • the processing module 501 may be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, artificial intelligence chips, on-chip chips, and the like.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the processing module 501 in FIG. A computer executes instructions, so that the memory controller can execute the method executed by the memory controller in any method embodiment above.
  • the functions/implementation process of the error correction unit 401 , the reading unit 402 , and the processing unit 403 in FIG. 4 can all be realized by the processing module 501 in FIG. 5 .
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

一种内存硬件故障检测方法,涉及存储技术领域,还公开了一种内存硬件故障检测装置,一种内存控制器。其中方法包括:内存控制器在将内存中目标位置上的数据纠错后写入到目标位置后,从目标位置处读取第一数据;内存控制器在确定第一数据存在错误的情况下,上报错误消息,错误消息指示目标位置发生硬件故障。内存控制器在发现目标位置上的数据出错后,会对数据进行纠错、回写以及重读,以进一步确定目标位置上发生的错误类型,减少由于软件失效而频繁上报错误信息的情况,提高内存控制器上报错误消息的准确率。

Description

一种内存硬件故障检测方法、装置以及内存控制器
相关申请的交叉引用
本申请要求在2021年11月25日提交中国专利局、申请号为202111415009.5、申请名称为“一种内存硬件故障检测方法、装置以及内存控制器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及存储技术领域,尤其涉及一种内存硬件故障检测方法、装置以及内存控制器。
背景技术
内存(memory)是计算机的重要部件之一,它用于暂时存放计算机中处理器的运算数据,与硬盘等外部内存交换的数据。内存控制器位于计算机内部,用于管理内存并且负责内存与处理器之间的数据交换。
当内存控制器在处理器的指示下读取数据时,内存控制器可以对读取的数据进行校验。确定该数据中是否出错,若读取的数据出错,内存控制器会向处理器上报错误。由处理器进行决策处理,如可以进行告警、复位等处理。
但是读取的数据出错常常是因为软失效导致的,这种软失效导致的错误通常是暂时性的,若内存控制器频繁地上报由软失效导致的错误,会增加处理器的负担,增加一些不必要的工作。
发明内容
本申请提供一种内存硬件故障检测方法、装置以及内存控制器,用以提高内存控制器上报错误的准确度。
第一方面,本申请实施例提供了一种内存硬件故障检测方法,该方法可以由内存控制器执行。内存控制器可以在处理器的指示下,从内存中的目标位置上读取数据,若发现数据出错,会对目标位置上的数据纠错后写入到目标位置。内存控制器为了进一步确定目标位置上发生的错误类型,可以在目标位置上的数据纠错后写入到目标位置后,从目标位置处读取第一数据。若确定第一数据存在错误,内存控制器可以上报错误消息,错误消息指示目标位置发生硬件故障。
通过上述方法,内存控制器在发现目标位置上的数据出错后,会对数据进行纠错、回写以及重读,以进一步确定该目标位置上发生的错误类型,在确定目标位置上发生硬件故障时上报错误信息。减少由于软失效而频繁上报错误信息的情况,提高内存控制器上报错误的准确率。
在一种可能的实施方式中,内存控制器将内存中的目标位置上的数据纠错后写入到目标位置的情况有很多种。例如,当处理器可以向内存控制器发起指示,以指示从目标位置读取数据。内存控制器在接收到处理器的指示后,可以从目标位置读取第二数据。内存控 制器在确定第二数据出错的情况下,对第二数据进行纠错,将纠错后的数据写入到目标位置。
通过上述方法,内存控制器可以在处理器的指示下从目标位置读取数据,在发现数据存在错误的情况下,之后再通过纠错、回写以及重读等操作确定目标位置发生的错误类型,以保证后续上报的错误消息的准确程度。
在一种可能的实施方式中,若内存控制器发现第一数据无错误,说明目标位置上发生的错误可能是由于软失效导致的。在这种情况下,内存控制器可以记录该目标位置。
内存控制器在确定该目标位置发生软失效导致的错误的次数超过一定阈值时,内存控制器可以上报软失效消息,该软失效消息指示该目标位置发生软失效导致的错误已超过阈值。
通过上述方法,内存控制器对于软失效导致的错误,仅是记录目标位置,能够在一定程度减少上报错误消息的频率。
在一种可能的实施方式中,内存控制器从内存的目标位置处读取第一数据时,可以针对目标位置进行读隔离配置,该读隔离配置可以屏蔽除内存控制器之外的组件对目标位置的读取操作;在进行了读隔离配置的情况下,内存控制器从内存的目标位置处读取第一数据。在读取了第一数据之后,可以解除读隔离配置。
通过上述方法,在读取第一数据之前对目标位置进行读隔离配置,在读取第一数据之后,解除读隔离配置,能够减少其他组件读取存在错误的第一数据。
在一种可能的实施方式中,内存控制器将第一数据写入到目标位置时,可以针对目标位置进行写隔离配置,该写隔离配置用于屏蔽除内存控制器之外的组件对目标位置的写入操作。在内存控制器将第一数据写入到目标位置之后,可以解除写隔离配置。
通过上述方法,在写入第一数据之前对目标位置进行写隔离配置,在写入第一数据之后,解除写隔离配置,能够减少其他组件对本身存在错误的第一数据再进行修改。
在一种可能的实施方式中,内存控制器上报错误消息之前,内存控制器可以多次对第一数据进行纠错、回写以及重读,在确定重读的数据仍存在错误的情况下,可以确定该目标位置发生硬件故障。
通过上述方法,多次进行纠错、回写以及重读,以准确的确定目标位置发生的错误类型。
第二方面,本申请实施例还提供了一种故障检测装置,该故障检测装置具有实现上述第以第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元。在一个可能的设计中,所述故障检测装置的结构中包括纠错单元、读取单元以及处理单元,这些单元可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请实施例还提供了一种内存控制器,该内存控制器具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。
在一种可能的实现方式中,所述内存控制器的结构中可以包括处理模块和存储器,所述处理模块被配置为支持所述内存控制器执行上述第一方面方法中内存控制器相应的功能。所述存储器与所述处理模块耦合,其保存所述网卡必要的程序指令和数据。该内存控制器中还可以包括接口,用于与其他组件或装置通信,例如,接收处理器的指示等。
在另一种可能的实现方式中,所述内存控制器的结构中也可以包括处理模块和接口,所述内存控制器被配置为支持所述内存控制器执行上述第一方面方法中相应的功能。该处理器还可以通过所述接口进行通信,例如,接收处理器的指示等。
第四方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实施方式中所述的方法。
第五方面,本申请还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实施方式中所述的方法。
第六方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可能的实施方式中所述的方法。
附图说明
图1为提供的一种单板架构示意图;
图2为本申请实施例提供的一种系统架构示意图;
图3为本申请实施例提供的一种内存硬件故障检测方法示意图;
图4为本申请实施例提供的一种故障检测装置结构示意图;
图5为本申请实施例提供的一种内存控制器结构示意图。
具体实施方式
如图1所示,为一种计算机中单板示意图。计算机的单板上可以部署有中央处理器(central processing unit,CPU)、内存控制器以及内存。
在内存中可以暂时存放处理器需要运算的数据或从计算机的外存(如硬盘)中获取的数据。内存中包括多个存储单元,每个存储单元包括一个晶体管和一个电容,电容的状态决定了该存储单元所存储的数据为0或1。
内存控制器能够对内存进行控制,以实现在内存和CPU之间的数据交换。当CPU需要从内存中读取数据时,CPU可以向内存控制器发起指示,以请求内存控制器从内存中读取数据。内存控制器在从内存中读取数据之后,对读取的数据进行校验。若该数据校验通过,内存控制器将数据反馈给处理器。若该数据校验未通过,说明存在错误数据,内存控制器可以对该错误数据进行纠错,若纠错成功,内存控制器可以将纠错后的数据反馈给处理器。
目前,错误校验检测(error checking and correction,ECC)是一种适用于内存、缓存(cache)或其他存储介质的错误校验检测算法。
以内存控制器运行ECC为例,当内存控制器需要将数据写入到内存时,内存控制器会根据待写入的数据计算出校验码,将数据以及该数据的校验码一同写入到内存中。
当内存控制器需要从内存中读取数据时,内存控制器在从内存中读取数据时,也会读取内存中的校验码。内存控制器可以利用获取的数据重新计算校验码。内存控制器对从内存中读取的校验码以及重新计算的校验码进行比对,若一致,说明数据正常,校验通过。若不一致,说明数据出错,该数据中发生了数据错误,校验未通过。若校验未通过,内存 控制器可以对读取的数据进行纠错。
数据错误分为可纠正错误(correctable error,CE)以及不可纠正错误(uncorrectable error,UCE)。
可纠正错误通常指数据中存在单个比特的错误数据,内存控制器通过校验能够检测出数据中哪一个比特发生了错误,通过对发生错误的比特进行翻转可以达到纠错的效果。内存中发生的可纠正错误有一部分并非重复发生在固定位置上。也即这类可纠正错误是暂时的,通常可以认为是软失效引起的。软失效是指:当高能亚原子粒子穿越内存中的存储单元时,同时会伴随着自由电荷的产生,这些自由电荷在极短的时间间隔内会聚集在存储单元内部的电路节点上。当聚集的自由电荷超过一定程度,存储单元所存储的数据会改变,也就产生了数据错误。软失效对存储单元中电路的损坏并不是永久性的,当数据通过纠正回写回至内存,数据错误即可得到恢复。
不可纠正错误通常指数据中存在多个比特的错误。对于不可纠正错误,内存控制器无法对该类错误进行纠错。产生UCE的原因一般认为是内存中存储单元的硬件故障。如存储单元中的晶体管或电容击穿。
当内存控制器在读取数据的过程中若发现数据错误,内存控制器可以将所发现的CE或者UCE上报至处理器。由CPU确定对该CE以及UCE的处理操作。由于CE通常是可以纠错的,CPU仅需记录CE即可,若上报的CE较多,已经超过阈值,CPU可以触发单板复位。对于UCE,CPU可以采用上报告警或者单板复位的处理操作。
在ECC机制中,内存控制器会将所有检测的CE以及UCE均会上报至CPU,这样会对CPU产生过多的负担,占用较多的CPU资源。CPU若进行上报告警以及单板复位等操作,CPU也会对正常业务造成影响。
从前述关于CE以及UCE的描述可知,对于一些由于软失效所引起的CE是暂时的,并不需要频繁上报。为了降低内存控制器向CPU上报错误的频率,本申请实施例提供的一种内存硬件故障检测方法。下面对结合附图对本申请实施例所应用的系统、方法、设备进行说明。
如图2所示,为本申请实施例提供的一种系统的结构示意图,该系统包括内存控制器110和内存120。
内存控制器110用于管理内存120,如管理内存120中的存储空间,从内存120中读取数据或将数据写入到内存120中。
在本申请实施例中内存控制器110具备如下功能:
1)检测功能。
内存控制器110能够对从内存120中读取的数据进行检测,检测数据中存在的错误数据。本申请实施例并不限定内存控制器110检测读取的数据中存在错误数据的方式。例如,内存控制器110可以采用ECC算法对读取的数据进行检测,也可以采用其他方式。凡是能够检测出数据中存在错误数据的方式均适用于本申请实施例。
2)、纠错功能。
若内存控制器110检测到该数据中存在错误数据时,内存控制器110还可以对数据进行纠错,修正数据存在的错误数据,内存控制器110可以将纠错后的数据重新写入到内存120中。
3)、上报功能。
内存控制器110能够上报数据错误。数据中存在的错误数据的原因通常是由于存储该数据的存储单元发生故障,如存储单元存在软失效或存储单元发生硬件故障。在本申请实施例中为了控制内存控制器110上报频率,内存控制器110不需要上报所有检测到的数据错误,而是只针对由于硬件故障引起的数据错误进行上报。例如,内存控制器110可以上报存储发生错误的数据的位置发生了硬件错误。
内存控制器110可以包括数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件、人工智能芯片、片上芯片等。
该内存120可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)、动态随机存取存储器(dynamic random access memory,DRAM)等。也可以包括非易失性存储器(non-volatile memory),例如存储级存储器(storage-class memory,SCM)等,或者易失性存储器与非易失性存储器的组合等。
本申请实施例并不限定如图2所示的系统部署的位置。例如,该系统可以位于计算机中,计算机中的处理器可以指示内存控制器110将数据写入到内存120中,或从内存120中读取该数据。内存控制器110在读取数据的过程中,若发现由于硬件故障引起的数据错误,可以向处理器上报该错误信息,通知处理器该数据存储的位置存在硬件故障。
不同于ECC机制,在本申请实施例中当内存控制器110从内存120中的某一个位置上读取数据时,若检测到该数据存在错误,内存控制器110可以将该数据进行纠错后再回写到该位置处。内存控制器110再次从该位置处读取该纠错后的数据,若内存控制器110确定该纠错后的数据仍存在错误,内存控制器110可以上报错误消息,该错误消息指示存储该数据的位置发生硬件故障。也就是说,内存控制器110不再需要针对软失效引起的数据错误进行上报,而仅是上报对内存120的性能有明显影响的硬件故障,有效的减少了内存控制器110的上报错误消息频率,另外,通过纠错后的数据回写以及重读纠错后的数据能够准确的识别出由于硬件故障引起的数据错误,提高了内存控制器110上报的准确率。
下面结合图3对本申请实施例提供的一种内存120硬件故障检测方法进行说明。
步骤301:处理器向内存控制器110发出数据读取指示,该数据读取指示用于请求从目标位置处读取数据。处理器向内存控制器110发送数据读取指示时,该数据读取指示中可以携带数据的逻辑地址,该数据的逻辑地址可以指示内存120中的目标位置。
步骤302:内存控制器110接收到数据读取指示后,从内存120中的目标位置读取数据A。内存控制器110在接收到数据读取指示后,可以根据该数据的逻辑地址确定该数据的物理地址,从物理地址所指示的目标位置处读取数据A。
步骤303:内存控制器110在读取数据A后,检测该数据A是否存在错误。若数据A不存在错误,则执行步骤304;若数据A存在错误,则执行步骤305。
关于内存控制器110检测数据是否有错误的方式可以参见前述说明,此处不再赘述。内存控制器110通过检测数据A,可以确定该数据A中是否存在错误。这里需要说明的是,数据A中存在错误是指数据A中部分或全部数据出错,与数据A在写入到该目标位置之前是不同的。为方便说明,数据A中出错的数据称为错误数据。
步骤304:内存控制器110向处理器反馈数据读取响应,该数据读取响应中携带有数据A。
步骤305:内存控制器110对该数据A进行纠错。若纠错失败,则执行步骤306,若纠错成功,则执行步骤307。
若数据A中存在错误数据,内存控制器110还可以确定数据A中的错误数据,进而可以确定该目标位置中存储该错误数据的位置。数据A是以比特进行度量的,存储该错误数据的位置在本申请实施例中也可以称为错误比特。
步骤306:内存控制器110向处理器反馈数据A,并向处理器上报第一错误消息,该第一错误消息指示该错误比特上发生了不可纠正错误或该错误比特上发生了硬件故障。
需要说明的是,数据错误分为不可纠正错误以及可纠正错误,但本申请实施例并不限定数据错误的具体分类。在一些场景中,由于硬件故障引起的数据错误也可以被命名为其他名称。本申请实施例中并不限定第一错误消息指示硬件故障的方式。第一错误消息可以采用直接的方式指示该错误比特上发生了硬件故障,也可以采用间接的方式指示该错误比特发生了不可纠正错误(或其他由于硬件故障引起的数据错误)。
步骤307:若数据A纠错后为数据B,内存控制器110向处理器反馈数据B。
步骤308:内存控制器110将数据B回写至目标位置。在执行步骤308时,内存控制器110可以直接将数据B回写至目标位置,也可以只针对错误比特进行数据回写。内存控制器110可以将错误数据进行纠错后,将纠错后的错误数据回写至错误比特。
步骤309:内存控制器110再次从目标位置处读取数据,检测所读取的数据是否存在错误。在该步骤中,内存控制器110进行了数据重读操作,该数据重读操作并非是在处理器的指示下进行的,而是由内存控制器110自行实施的。
在执行步骤309时,内存控制器110可以读取目标位置读取的所有数据,检测该目标位置上存储的数据是否存在错误。内存控制器110也可以只读取错误比特上存储的数据,检测错误比特上存储的数据是否存在错误。
若所读取的数据(该数据可以为目标位置上存储的数据,也可以是错误比特上存储的数据)不存在错误,则执行步骤310;若所读取的数据存在错误,则执行步骤310。
步骤310:内存控制器110记录错误比特。若所读取的数据不存在错误,说明该错误比特可能发生了软失效,仅是暂时出现了错误数据。内存控制器110可以不进行上报,仅记录该错误比特。
步骤311:内存控制器110上报第二错误消息,该第二错误消息指示错误比特发生不可纠正错误或存在硬件故障。通过将纠错后的错误数据回写入到错误比特,之后再从错误比特重新读取数据,若仍发现读取的数据仍存在错误,说明错误比特上存在硬件故障,导致纠错后的数据再次写入到错误比特,该数据仍会出错。
第二错误消息的指示方式与第一错误消息的指示方式类似,具体可以参见前述内容,此处不再赘述。
步骤308~步骤311中,内存控制器110通过将纠错后的错误数据回写、从错误比特重读数据等操作,可以确定出错误比特的错误类型,如确定该错误比特的故障为软失效导致的故障或者为硬件故障。
内存控制器110在将纠错后的错误数据回写到错误比特或从错误比特重读数据时,内存控制器110可以对该错误比特配置写隔离配置,写隔离配置用于屏蔽除内存控制器110之外的组件对该错误比特的写入操作。内存控制器110还可以对该错误比特配置读隔离配置,读隔离配置用于屏蔽除内存控制器110之外的组件对该错误比特的读取操作。
通过避免其他组件(如处理器)在错误数据回写或重读的过程中对该错误比特进行读取或写入的操作,内存控制器110能够准确的确定出该错误比特的错误类型,另外,也能够避免其他组件从错误比特上读取错误数据或使得写入到错误比特上的数据出错。
这里以其他组件为处理器为例,对内存控制器110对错误比特配置写隔离配置或读配置隔离的方式进行说明。
内存控制器110可以向处理器发起指示,该指示用于请求对错误比特进行写隔离配置或读隔离配置。处理器在接收到该指示后,可以利用处理器内部的寄存器针对该错误比特进行写反压配置以及读反压配置。这样,即便处理器需要对错误比特进行写或者读,由于处理器内部寄存器针对该错误比特进行写反压配置以及读反压配置,处理器也不会通过内存控制器110针对该错误比特进行读或写,这样就达到了对错误比特进行写隔离配置或读隔离配置的效果。
在图3所示的实施例中,是以内存控制器110执行了一次针对错误比特的回写以及重读操作为例进行说明,在内存控制器110确定错误比特的错误类型时,内存控制器110也可以多次执行针对错误比特的回写以及重读操作,在确定多次对错误比特进行回写或重读操作后,该错误比特上读取的数据仍存在错误,则可以确定该错误比特上发生了硬件故障,保证错误类型的准确性。
基于与方法实施例同一发明构思,本申请实施例还提供了一种故障检测装置,该故障检测装置用于执行上述如图3所示的方法实施例中内存控制器执行的方法,相关特征可参见上述方法实施例,此处不再赘述。如图4所示,该故障检测装置400可以部署在内存控制器中。故障检测装置400包括纠错单元401、读取单元402、处理单元403。
纠错单元401,用于将内存中目标位置上的数据纠错后写入到目标位置。
读取单元402,用于在纠错单元401将内存中目标位置上的数据纠错后写入到目标位置后,从目标位置处读取第一数据。
处理单元403,用于在确定第一数据存在错误的情况下,上报错误消息,错误消息指示目标位置发生硬件故障。
在一种可能的实施方式中,纠错单元401将内存中目标位置上的数据纠错后写入到目标位置之前,读取单元402可以在处理器的指示下从目标位置读取第二数据;当纠错单元401在确定第二数据出错的情况下,可以对第二数据进行纠错,将纠错后的数据写入到目标位置。
在一种可能的实施方式中,处理单元403若确定第一数据无错误的情况下,记录目标位置。
在一种可能的实施方式中,读取单元402在读取第一数据之前,还可以针对目标位置进行读隔离配置,读隔离配置用于屏蔽除内存控制器之外的组件对目标位置的读取操作;从内存的目标位置处读取第一数据,在读取第一数据之后,解除读隔离配置。
在一种可能的实施方式中,纠错单元401在写入第一数据之前,还可以针对目标位置进行写隔离配置,写隔离配置用于屏蔽除内存控制器之外的组件对目标位置的写入操作;将第一数据写入到目标位置,在第一数据写入之后,解除写隔离配置。
在一种可能的实施方式中,纠错单元401在处理单元403上报错误消息之前,可以多次对目标位置上的数据进行纠错、回写、以及重读,以准确的确定目标位置上发生的错误类型。例如,纠错单元401可以对第一数据进行纠错,将纠错后的第一数据再次写入目标 位置;之后,处理单元403从目标位置读取纠错后的第一数据,在确定纠错后的第一数据存在错误。
需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
在一个简单的实施例中,本领域的技术人员可以想到如图3所示的实施例内存控制器可采用图5所示的形式。
如图5所示的内存控制器500,包括处理模块501。可选的,还可以包括存储器502,以及接口503。
存储器502可以是易失性存储器,例如随机存取存储器;存储器也可以是非易失性存储器,例如只读存储器,快闪存储器,硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)、或者存储器502是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器502可以是上述存储器的组合。
处理模块501可以是其他通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件、人工智能芯片、片上芯片等。通用处理器可以是微处理器或者是任何常规的处理器等。
当所述内存控制器500采用图5所示的形式时,图5中的处理模块501可以执行上述任一方法实施例中的所述内存控制器执行的方法,也可以通过调用存储器502中存储的计算机执行指令,使得所述内存控制器可以执行上述任一方法实施例中的所述内存控制器执行的方法。
具体的,图4的纠错单元401、读取单元402、处理单元403的功能/实现过程均可以通过图5中的处理模块501来实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程 序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (13)

  1. 一种内存硬件故障检测方法,其特征在于,所述方法包括:
    内存控制器在将内存中目标位置上的数据纠错后写入到目标位置后,从所述目标位置处读取第一数据;
    所述内存控制器在确定所述第一数据存在错误的情况下,上报错误消息,所述错误消息指示所述目标位置发生硬件故障。
  2. 如权利要求1所述的方法,其特征在于,所述内存控制器将内存中目标位置上的数据纠错后写入到目标位置,包括:
    所述内存控制器在处理器的指示下从所述目标位置读取第二数据;
    所述内存控制器在确定所述第二数据出错的情况下,对所述第二数据进行纠错,将纠错后的数据写入到所述目标位置。
  3. 如权利要求1或2所述的方法,其特征在于,所述方法还包括:
    所述内存控制器在确定所述第一数据无错误的情况下,记录所述目标位置。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述内存控制器从内存的目标位置处读取第一数据,包括:
    所述内存控制器针对所述目标位置进行读隔离配置,所述读隔离配置用于屏蔽除所述内存控制器之外的组件对所述目标位置的读取操作;
    所述内存控制器从所述目标位置处读取所述第一数据,解除所述读隔离配置。
  5. 如权利要求2所述的方法,其特征在于,所述内存控制器将所述第一数据写入到所述目标位置,包括:
    所述内存控制器针对所述目标位置进行写隔离配置,所述写隔离配置用于屏蔽除所述内存控制器之外的组件对所述目标位置的写入操作;
    所述内存控制器将所述第一数据写入到所述目标位置,解除所述写隔离配置。
  6. 如权利要求1~5任一项所述的方法,其特征在于,所述内存控制器上报错误消息之前,还包括:
    所述内存控制器对所述第一数据进行纠错,将纠错后的所述第一数据再次写入所述目标位置;
    所述内存控制器从所述目标位置读取纠错后的所述第一数据,在确定所述纠错后的所述第一数据存在错误。
  7. 一种故障检测装置,其特征在于,所述装置包括纠错单元、读取单元、处理单元;
    所述纠错单元,用于将内存中目标位置上的数据纠错后写入到目标位置;
    所述读取单元,用于在所述纠错单元将所述目标位置上的数据纠错后写入到目标位置后,从所述目标位置处读取第一数据;
    所述处理单元,用于在确定所述第一数据存在错误的情况下,上报错误消息,所述错误消息指示所述目标位置发生硬件故障。
  8. 如权利要求7所述的装置,其特征在于,所述读取单元,还用于在处理器的指示下从所述目标位置读取第二数据;
    所述纠错单元,还用于在确定所述第二数据出错的情况下,对所述第二数据进行纠错,将纠错后的数据写入到所述目标位置。
  9. 如权利要求7或8所述的装置,其特征在于,所述处理单元,还用于:
    在确定所述第一数据无错误的情况下,记录所述目标位置。
  10. 如权利要求7-9任一项所述的装置,其特征在于,所述读取单元,用于:
    针对所述目标位置进行读隔离配置,所述读隔离配置用于屏蔽除内存控制器之外的组件对所述目标位置的读取操作;
    从所述内存的目标位置处读取所述第一数据,解除所述读隔离配置。
  11. 如权利要求8所述的装置,其特征在于,所述纠错单元,用于:
    针对所述目标位置进行写隔离配置,所述写隔离配置用于屏蔽除内存控制器之外的组件对所述目标位置的写入操作;
    将所述第一数据写入到所述目标位置,解除所述写隔离配置。
  12. 如权利要求7-11任一项所述的装置,其特征在于,所述纠错单元在所述处理单元上报错误消息之前,还用于:
    对所述第一数据进行纠错,将纠错后的所述第一数据再次写入所述目标位置;
    所述处理单元,还用于从所述目标位置读取纠错后的所述第一数据,在确定所述纠错后的所述第一数据存在错误。
  13. 一种内存控制器,其特征在于,包括处理模块;所述处理模块运行程序指令以执行权利要求1-6任一项所述的方法。
PCT/CN2022/115593 2021-11-25 2022-08-29 一种内存硬件故障检测方法、装置以及内存控制器 WO2023093173A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111415009.5A CN116166459A (zh) 2021-11-25 2021-11-25 一种内存硬件故障检测方法、装置以及内存控制器
CN202111415009.5 2021-11-25

Publications (1)

Publication Number Publication Date
WO2023093173A1 true WO2023093173A1 (zh) 2023-06-01

Family

ID=86418765

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115593 WO2023093173A1 (zh) 2021-11-25 2022-08-29 一种内存硬件故障检测方法、装置以及内存控制器

Country Status (2)

Country Link
CN (1) CN116166459A (zh)
WO (1) WO2023093173A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7350101B1 (en) * 2002-12-23 2008-03-25 Storage Technology Corporation Simultaneous writing and reconstruction of a redundant array of independent limited performance storage devices
CN107643955A (zh) * 2016-07-27 2018-01-30 中电海康集团有限公司 一种基于纠错回写技术提升非易失存储器性能的方法及非易失存储器结构
CN110046061A (zh) * 2019-03-01 2019-07-23 华为技术有限公司 内存错误处理方法和装置
CN111694691A (zh) * 2020-06-10 2020-09-22 西安微电子技术研究所 一种纠检错后具有自动回写功能的sram电路及回写方法
CN113821364A (zh) * 2020-06-20 2021-12-21 华为技术有限公司 内存故障的处理方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7350101B1 (en) * 2002-12-23 2008-03-25 Storage Technology Corporation Simultaneous writing and reconstruction of a redundant array of independent limited performance storage devices
CN107643955A (zh) * 2016-07-27 2018-01-30 中电海康集团有限公司 一种基于纠错回写技术提升非易失存储器性能的方法及非易失存储器结构
CN110046061A (zh) * 2019-03-01 2019-07-23 华为技术有限公司 内存错误处理方法和装置
CN111694691A (zh) * 2020-06-10 2020-09-22 西安微电子技术研究所 一种纠检错后具有自动回写功能的sram电路及回写方法
CN113821364A (zh) * 2020-06-20 2021-12-21 华为技术有限公司 内存故障的处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN116166459A (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
US9104595B2 (en) Selective remedial action based on category of detected error for a memory read
TWI468942B (zh) 用於提供資料完整性之裝置及方法
US7971112B2 (en) Memory diagnosis method
US9208027B2 (en) Address error detection
TW201312577A (zh) 用於提供資料完整性之裝置及方法
US20120079346A1 (en) Simulated error causing apparatus
TWI509624B (zh) 快閃記憶體裝置、記憶體控制器及快閃記憶體的控制方法
WO2020073691A1 (zh) 闪存自检的方法、固态硬盘以及存储装置
CN103218271B (zh) 一种数据纠错方法及装置
US9086990B2 (en) Bitline deletion
CN102135925A (zh) 用于检测错误检查和纠正内存的方法和装置
CN117280328A (zh) 存储器地址保护
US8738989B2 (en) Method and apparatus for detecting free page and a method and apparatus for decoding error correction code using the method and apparatus for detecting free page
US10489244B2 (en) Systems and methods for detecting and correcting memory corruptions in software
US20230325276A1 (en) Error correction method and apparatus
TWI467364B (zh) 記憶體儲存裝置、記憶體控制器與資料寫入方法
US7577804B2 (en) Detecting data integrity
WO2023093173A1 (zh) 一种内存硬件故障检测方法、装置以及内存控制器
CN115543678B (zh) 监管ddr5内存颗粒错误的方法、系统、存储介质及设备
US10025652B2 (en) Error location pointers for non volatile memory
CN114203252A (zh) 非易失存储器的坏块检测方法、装置、设备及存储介质
TW202105922A (zh) 快閃記憶體控制器、儲存裝置及其讀取方法
US8595570B1 (en) Bitline deletion
WO2023104008A1 (zh) 一种数据纠错方法、装置、内存控制器及系统
US11809272B2 (en) Error correction code offload for a serially-attached memory device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897265

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022897265

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022897265

Country of ref document: EP

Effective date: 20240524