WO2023106434A1 - Procédé de correction d'erreur assistée par dram au moyen d'une interface sdram ddr - Google Patents

Procédé de correction d'erreur assistée par dram au moyen d'une interface sdram ddr Download PDF

Info

Publication number
WO2023106434A1
WO2023106434A1 PCT/KR2021/018371 KR2021018371W WO2023106434A1 WO 2023106434 A1 WO2023106434 A1 WO 2023106434A1 KR 2021018371 W KR2021018371 W KR 2021018371W WO 2023106434 A1 WO2023106434 A1 WO 2023106434A1
Authority
WO
WIPO (PCT)
Prior art keywords
dram
error
memory
memory controller
errors
Prior art date
Application number
PCT/KR2021/018371
Other languages
English (en)
Korean (ko)
Inventor
변경수
Original Assignee
주식회사 딥아이
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 딥아이 filed Critical 주식회사 딥아이
Publication of WO2023106434A1 publication Critical patent/WO2023106434A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/21Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
    • G11C11/34Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
    • G11C11/40Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
    • G11C11/401Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
    • G11C11/4063Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing
    • G11C11/407Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing for memory cells of the field-effect type
    • G11C11/409Read-write [R-W] circuits 
    • G11C11/4096Input/output [I/O] data management or control circuits, e.g. reading or writing circuits, I/O drivers or bit-line switches 
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check

Definitions

  • One or more aspects of embodiments in accordance with the present invention relate to methods and mechanisms for memory error correction.
  • Double data rate synchronous dynamic random-access memory is a type of memory integrated circuit (IC) used in computers.
  • DDR SDRAM can achieve a faster transfer rate by using timing control of electrical data and clock signals, and can transmit data at both the rising edge and falling edge of the clock signal, Doing so effectively doubles the data bus bandwidth compared to an SDR SDRAM (single data rate synchronous dynamic random-access memory) interface using the same clock frequency, and in doing so achieves approximately double the bandwidth. .
  • DRAM of various generations may use error-correcting code (ECC) memory during data storage to both detect and sometimes correct common types of data errors.
  • ECC memory is immune to single bit errors through the use of parity checking.
  • a parity check is an extra parity bit indicating the (odd or even) parity of data (eg, 1-byte data) stored in a memory (eg, a parity device or ECC chip in a DRAM module). , and independently calculating parity, and comparing the calculated parity with the stored parity to detect whether a data error/memory error has occurred.
  • DRAM dual in-line memory module
  • ECC Error Correction Code
  • the present invention is to solve the above-described technical problem, the present invention can provide a DRAM support error correction mechanism for a DDR SDRAM interface.
  • aspects of embodiments of the present disclosure relate to a new DDR interface that uses ECC in DRAM to correct errors.
  • Embodiments of the present invention provide an architecture capable of providing the same basic chipkill RAS features as provided by DDR4, reduced ECC chip overhead (i.e. one ECC chip per memory channel), reduced internal prefetch It is possible to provide these features while making reduced changes in size and the DDR interface to the interface corresponding to DDR4.
  • FIG. 1 is a block diagram illustrating an error correction mechanism for a DDR interface according to an embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention.
  • FIG. 3 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention.
  • FIG. 4 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention.
  • FIG. 5 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention.
  • FIG. 6 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention.
  • FIG. 7A and 7B are flow charts illustrating error detection, error type determination, and error handling using a DRAM assist error correction code (DAECC) mechanism in accordance with one or more embodiments of the present invention.
  • DECC DRAM assist error correction code
  • first,” “second,” “third,” and the like may be used herein to describe various components, components, regions, layers, and/or sections, such components , components, regions, layers and/or sections will be understood not to be limited by these terms. These terms are used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first component, component, region, layer or section described below could be described as a second component, component, region, layer or section without departing from the spirit and scope of the present invention.
  • the x-axis, y-axis, and z-axis are not limited to the three axes of the Cartesian coordinate system and can be interpreted in a broad sense.
  • the x-axis, y-axis, and z-axis may be orthogonal to each other, or may represent other directions that are not orthogonal to each other.
  • the terms “approximately”, “about” and similar terms are used as terms of approximation, not terms of degree. It is intended to account for inherent variations in measured or calculated values identified by those skilled in the art. Also, the use of “may” when describing embodiments of the present invention means “one or more embodiments of the present invention”. As used herein, the terms “use”, “used”, and “used” may be considered synonyms for the terms “use”, “used”, and “used”, respectively. Also, the term “exemplary” means an example or illustration.
  • Electronic or electrical devices may be any suitable hardware, firmware (eg, Application Specific Integrated Circuit (ASIC)), software, or may be implemented using a combination of software, firmware, and hardware.
  • ASIC Application Specific Integrated Circuit
  • the various elements of these devices may be formed as one Integrated Circuit (IC) chip or as separate IC chips.
  • various elements of these devices may be implemented on a flexible printed circuit film (Flexible Printed Circuit Film), a Tape Carrier Package (TCP), a Printed Circuit Board (PCB), or formed on a single substrate.
  • IC Integrated Circuit
  • PCB Printed Circuit Board
  • the various elements of these devices may be processes or threads running on one or more computing devices or on one or more processors that execute computer program instructions and interact with other system elements to perform the various functions described herein.
  • Computer program instructions are stored in memory implemented in a computing device using standard memory devices, such as, for example, random access memory (RAM).
  • Computer program instructions may also be stored on other non-transitory computer readable media such as, for example, a CD-ROM, flash drive, or the like.
  • RAM random access memory
  • Computer program instructions may also be stored on other non-transitory computer readable media such as, for example, a CD-ROM, flash drive, or the like.
  • those skilled in the art will understand that the functions of various computing devices may be integrated or integrated into one computing device, and the functions of a specific computing device may be implemented in one or more other computing devices without departing from the spirit and scope of the exemplary embodiments of the present invention. It should be recognized that it can be dispersed in the field.
  • Embodiments of the present invention described below in relation to FIGS. 1 to 3 may provide a basic chip kill. Despite lacking the same coverage as DDR4's mechanism, the new DDR interface can provide a single die/chip correction despite including only one ECC chip per memory channel. Additionally, mechanisms described below may use ECC information provided from an ECC chip to identify a case in which a data chip has failed.
  • FIG. 1 is a block diagram illustrating an error correction mechanism for a DDR interface according to an embodiment of the present invention.
  • the present invention can use 16 bursts per memory transaction and use in-DRAM ECC (ECC) in a dynamic random-access memory module (DRAM) to protect the data of one data chip in multiple bursts. there is.
  • ECC ECC
  • DRAM dynamic random-access memory module
  • Embodiments of the invention may also be used to protect data from multiple chips in one or more bursts by allowing the memory controller to read information in ECC in DRAM when the ECC in DRAM determines that a system ECC error has occurred. Make the DIMM use system/memory controller ECC.
  • the system 100 includes a memory controller 110 including a DRAM assist error correction code engine (DAECC) engine 120 stored therein.
  • the memory controller 110 may transmit data to various chips 140a and 140b of a DDR DIMM/DRAM (double data rate synchronous dynamic random-access memory dual in-line memory module, 160), and the various chips 140a and 140b Data can be transmitted from In Figure 1, one memory channel 150 of a DRAM 160 (or DRAM module) is shown.
  • the memory controller 110 performs memory processes 130 for exchanging “normal” data 170 with the DRAM 160 through normal read and write processes.
  • each chip 140a, 140b of the DRAM 160 provides 4-bit information through 4 pins of the chip.
  • Each memory channel 150 of DRAM 160 includes eight 4-bit data chips (e.g., eight data devices, 140a) for receiving, storing, and transmitting data.
  • the DRAM 160 uses one 4-bit ECC chip per memory channel 150 to transmit ECC data to the memory controller 110. (eg, one parity device or parity chip, 140b).
  • the DRAM 160 of an embodiment of the present invention can provide in-system error correction and basic chip kill capabilities, which will be further described below. will be.
  • DRAM 160 performs memory processing(s) 130 with memory controller 110 using bursts, whereby otherwise normally required to transfer each portion of data in a separate process. DRAM 160 repeatedly transmits data while omitting other steps to be performed. Thus, DRAM 160 can transfer data faster than it can transfer without bursting, even for limited periods only under certain conditions.
  • DRAM 160 includes a burst length of 16, which is double the burst length of DIMMs used such as DDR4, which have a burst length of 8. That is, to compensate for the fact that each of the two memory channels 150 contains only half of the data width of the memory channel being used like DDR4, DRAM 160 has 4 Twice the number of bursts of data stored in each of the bit chips 140a and 140b may be internally prefetched.
  • the data width is 32 bits per memory channel 150
  • ECC The width is 4 bits per memory channel 150. This corresponds to a total of 576 bits per memory process 130 per channel (i.e., 36 bits per burst multiplied by 16 bursts per memory process 130), the total data block is 512 bits, and the remaining 64 bits are memory This corresponds to ECC data of the ECC chip 140b of the channel 150.
  • the system 100 of an embodiment of the present invention can achieve error correction within the DRAM 160 without the assistance of the memory controller 110 .
  • ECC in DRAM of DRAM 160 allows data in multiple data chips 140a to be protected within one or more bursts, whereas memory controller-assisted error correction allows data to be protected within multiple bursts.
  • internal error correction is performed without communication with the memory controller 110.
  • system 100 operates in combination with performing error correction on memory controller 110 and DRAM 160 may be used.
  • ECC in DRAM is internal to the chip.
  • One bit error can be corrected using ECC bits generated internally in .
  • Embodiments of the present invention do not provide any changes to memory applications beyond those in DDR4, and do not provide a different interface between DRAM 160 and memory controller 110.
  • error recovery information is not included in the normal data transfer 170 and is not communicated to the memory controller 110. Therefore, correction of errors by ECC chip 140b will not affect the speed or performance of memory processing 130 with memory controller 110 .
  • memory controller 110 detects an error in the data (e.g., DRAM 160 indicates the existence of an error to memory controller 110), only then does memory controller 110 attempt to correct the error. It will issue some command to DRAM 160. That is, only when a system/memory controller ECC error occurs, the memory controller 110 will send a specific command to the DIMM and receive information included in the ECC information output 180 in DRAM from the DRAM 160 . After that, the memory controller 110 will determine which of the eight data chips 140a of the corresponding memory channel 150 is the cause of the detected error. A method used by the memory controller 110 to locate a failed data chip will be described below with reference to FIGS. 7A and 7B.
  • FIG. 2 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention. Similar to the error correction mechanism of the foregoing embodiment, the embodiment of the present invention performs 16 bursts per memory process.
  • the memory controller 210 includes a DAECC engine 220 stored therein, and each memory channel 250 of the DRAM 260 includes eight data chips 240a and one for general data exchange 270. of the ECC chip 240b.
  • the DRAM 260 notifies the memory controller 210 of an ECC error by sending a 1-bit ECC flag to the memory controller 210 through an additional pin.
  • system 200 of an embodiment of the present invention is similar to system 100 of an embodiment described with reference to FIG. 1 .
  • the DRAM/DDR DIMM 260 according to an embodiment of the present invention implements an additional pin 290 so that the DRAM 260 can easily transfer ECC information to the memory controller 210 .
  • the DRAM 260 can send an ALERT consisting of one 1-bit ECC flag. Uncorrectable errors in 260 are always exposed to the memory controller 210 immediately. Accordingly, when an error occurs, information is provided to the memory controller 210 from the ECC chip 240b in the DRAM among the chips 240a and 240b.
  • the pin 290 of the ECC chip 240b transmits 1 bit to inform the memory controller 210 that an error has occurred. It will be used to set the ECC flag.
  • memory controller 210 Upon sensing the 1-bit ECC flag via pin 290, memory controller 210 will issue a specific command to obtain more detailed information from DRAM 260 related to the error.
  • the memory controller 210 may obtain information from the DRAM 260 using the ECC information output 280 in the DRAM in a manner similar to the system 100 of the embodiment described with reference to FIG. 1 .
  • the DDR interface between the DRAM 260 and the memory controller 210 may be changed from DDR4.
  • the 1-bit ECC flag is sent over pin 290 along with the other 576 data bits, performance is not adversely affected and therefore the time per memory process 230 is not increased.
  • FIG. 3 is a block diagram illustrating an error correction mechanism for a DDR interface according to another embodiment of the present invention.
  • the embodiment of the present invention has a memory controller (with a DAECC engine 320 stored therein) 310), and each memory channel 350 of the DRAM 360 includes eight data chips 340a and one ECC chip 340b for general data 370 exchange.
  • each memory process 330 of system 300 of an embodiment of the present invention includes an additional burst contributing to a total burst length of 17 bursts.
  • the ECC information determined during in-DRAM error correction is transferred from DRAM/DDR DIMM 360 to memory controller 310 at in-DRAM ECC information output 380.
  • memory controller 310 may identify ECC information in DRAM at each memory process 330 at the cost of reduced performance associated with an additional burst for each memory process 330 .
  • the memory controller 110 outputs information of ECC information in the DRAM ( Unlike the system 100 of the embodiment of FIG. 1 performing 180), the memory controller 310 of the embodiment of the present invention outputs ECC information 380 in DRAM in every memory process 330 through an additional 17th burst. carry out Additionally, memory controller 310 will always perform the same operations to correct errors as performed by ECC in DRAM at the system level (eg, in DRAM 360 ).
  • the DAECC mechanism can be used for the new narrow DDR interface using 4-bit DRAM chips
  • the DRAM mechanism can be used for other narrow DDR interfaces using the new DDR interface if intra-DRAM ECC is performed inside the DRAM chip. It can be used with DRAM architectures, or in a DDR4-like interface.
  • FIG. 4 shows that the DAECC engine 420 can be used in conjunction with the new narrow DDR interface with 8-bit DRAM chips 440a and 440b.
  • each channel 450 includes four 8-bit DRAM chips 440a for storing data and one 8-bit DRAM chip 440b for storing system ECC. Therefore, the system ECC overhead of the embodiment of the present invention is 1/4 or 25%.
  • FIG. 5 shows that a DAECC engine can be used for a DDR interface with 4-bit DRAM chips 540a and 540b.
  • each channel 550 includes sixteen 4-bit DRAM chips 540a for storing data and one 4-bit DRAM chip 540b for storing system ECC. Therefore, in this embodiment, the system ECC overhead is 1/16 or 625%, half of the current ECC overhead of the DDR4 standard.
  • FIG. 6 shows that a DAECC engine can be used in a DDR4 interface with 8-bit DRAM chips 640a and 640b.
  • each channel 650 includes eight 8-bit DRAM chips 640a to store data and one 8-bit DRAM chip 640b to store system ECC. Therefore, the system ECC overhead is 1/8 or 125%.
  • the DAECC of the above embodiments uses ECC information in DRAM to use an additional ECC DRAM chip and to provide basic chipkill coverage for a memory system with any kind of structure. This is a general mechanism for
  • FIG. 7A and 7B are flow charts illustrating error detection, error type determination, and error handling using a DRAM assist error correction code (DAECC) mechanism in accordance with one or more embodiments of the present invention.
  • DECC DRAM assist error correction code
  • the memory controller may attempt to determine an error pattern. Depending on the error pattern, the memory controller may determine whether the error corresponds to a random error (non-permanent error), a permanent error, or a chipkill error. Depending on the determination of the type of error, the system may perform error correction.
  • a chipkill error generally corresponds to the permanent failure of one chip/die or chip that exceeds a threshold of bit errors.
  • a failure of one 4-bit data chip in a memory channel can cause a large number of 4-bit corresponding data chips to provide erroneous data in a large number of bursts during memory processing (e.g., corresponding data chips a large number of errors within 4-bit symbols)
  • embodiments of the present invention can detect when one chip has failed, and then deactivate that chip while continuing to provide one-chip correction. can do. That is, embodiments of the present invention may group 4 repetitive erroneous bits corresponding to one chip into one symbol, and correspond to a failed/dead/erased chip. A symbol-based mechanism may be used to recover the data to be retrieved.
  • each chip includes four data pins, and each pin outputs one of the four bits stored in the chip.
  • Each pin can be referred to as DQ. If one of the pins of a chip fails (as opposed to failing the entire chip), the data provided from that pin in each of the bursts is potentially erroneous. Thus, a pin failure can be referred to as a DQ error.
  • the described embodiments may use simple parity algorithms interleaved between each chip.
  • the described embodiments may use a redundant array of independent disks (RAID) configuration (eg, a configuration similar to RAID 4).
  • RAID 4 may include a dedicated parity disk and multiple disks (eg, a configuration similar to RAID 4).
  • RAID configuration that uses block-level striping across multiple disks. Disk striping involves dividing a group of data into blocks, and spreading the blocks across two or more storage devices (eg, data chips).
  • the data stored on the ECC chip will correspond to 4 parity bits, each bit from one of the 8 data chips contributes to a different parity bit stored by the ECC chip, and one from the corresponding pin of all the data chips.
  • a bit of s contributes to one of the parity bits (eg, one ECC group of an ECC chip).
  • the host memory controller detects one reoccurring 1-bit error corresponding to one pin using the data and ECC information, the host memory controller will identify the error from the same bit location out of 4 possible bit locations ( That is, the same parity bit of the ECC chip will contain errors that occur in some of the bursts)
  • information from the detected error pattern can be used to determine where the error is (e.g. e.g., which of the 8 data chips contains the pin causing the error)
  • an error is detected.
  • An error may be detected by any of the error detection methods described above with reference to FIGS. 1-3 (e.g., an ECC flag sensed by memory controller 210 via additional pin 290, or memory processing 330). error detected by the memory controller 310 in the ECC information output 380 in the DRAM during the 17th burst of ) and/or system ECC performed by the host memory controller (eg, parity check similar to RAID).
  • error detected by the memory controller 310 in the ECC information output 380 in the DRAM during the 17th burst of and/or system ECC performed by the host memory controller (eg, parity check similar to RAID).
  • step S702 determines that there are some errors from the DRAM/DDR DIMM in step S701, it is determined in step S702 how many bursts have errors. This can be done by running a parity check for each burst of memory processing. Since different types of faults include different error patterns, a threshold value, or reference number, “n” may be used to classify the type of detected error. For example, even if there is an ECC error in the same pin location of the ECC chip, the DQ fault will contain only a 1-bit parity error in all or many of the bursts of memory processing.
  • a chip failure will most likely involve a parity error in all of the bursts or multiple multiple bits (e.g., a 4-bit parity error).
  • a random error may correspond to a small number of errors, and there may be very few wrong bits in a small number of bursts (e.g., one 1-bit error in one burst).
  • the threshold it can be predicted that a DQ failure or chip failure has occurred. However, if the number of detected errors is less than or equal to the threshold value, it can be predicted that one or more random errors have occurred from one or more individual chips, which cannot be corrected by ECC in DRAM and memory for correction. Support from the controller is required.
  • step S702 it can be determined whether the number of bursts with errors in a given memory process exceeds the threshold “n”. Depending on whether the threshold is reached, either step S703 or step S710 proceeds. That is, if a sufficiently large number of bursts are determined to have parity errors (e.g., if the number of bursts with errors in a given memory process exceeds a threshold “n”), then there is a higher chance of a DQ failure or chip failure. high, the system will proceed with the chipkill mechanism.
  • step S701 if the number of bursts with detected error is less than threshold “n” (e.g., 4 bursts or less), then the error detected in step S701 is the result of random failures from the individual chip(s). It is highly likely that These random failures are not correctable by ECC in DRAM, but since the error will most likely not reoccur, the chipkill mechanism is not used.
  • threshold “n” e.g., 4 bursts or less
  • a threshold of four bursts is used as an example, it should be noted that other numbers may be used in other embodiments of the present invention. Moreover, the number can be adjusted in different embodiments (eg corresponding to a particular device). For example, if the corresponding memory device has a relatively high device error rate, and thus multiple random errors occur, the threshold value may be increased.
  • step S703 if it is determined in step S702 that the number of bursts with parity errors is greater than the threshold (eg greater than 4 bursts), the system determines whether one of the 8 data chips is dead or erased. or if there was an error in one or more of the eight data chips.
  • the threshold eg greater than 4 bursts
  • the system may determine at step S709 that a fatal and uncorrectable error has occurred.
  • the number of chips with errors is determined by the memory controller 110, 210, 310 at the ECC information output 180, 280, 380 in DRAM either by using a specific command or during the 17th burst of memory processing 330. can be detected by For example, if a chip dies from a DQ failure, chip failure, bank failure, or row failure, any additional DQ failure or chip failure will cause a catastrophic failure, and the system of an embodiment of the present invention may Since it contains only one ECC chip per channel, it will therefore no longer contain enough resources to perform error correction. Similarly, catastrophic failure occurs if multiple chips contain errors.
  • step S704 proceeds. For example, if no chip has been erased yet, and the detected error corresponds to only one chip, the system can identify the type of failure, and the memory controller can record the type of failure and potentially rectify the error. Additional steps may be taken to correct it.
  • step S704 it may be determined whether the detected errors occur on the same pin. That is, in step S704, the system determines whether a detected error occurs on the same pin for each of the bursts. If all the errors correspond to the same pin, at step S705a, the system in the DRAM may notify the memory controller so that the memory controller may record the failure type of the error as a DQ failure. If the system determines that the detected errors are not all on the same pin, at step S705b, the system in the DRAM can notify the memory controller so that the memory controller can record the failure type as a chip failure.
  • step S706 after the failure type is recorded as either a DQ failure or a chip failure, the memory controller is used to support chipkill detection.
  • the memory controller may start a diagnosis routine to determine how to correct the error. The diagnosis routine will be described later with reference to FIG. 7B.
  • step S706a the memory controller has already read ECC information in DRAM provided from ECC in DRAM in step S703 of FIG. ECC information output (180, 280, or 380) in DRAM of the ECC bits provided by ) Then, in step S706b, the memory controller will maintain all ongoing memory processes. Then, in step S706c, the memory controller stores the current data (e.g., 512 bits corresponding to the current data D[511:0] of eight 4-bit data chips) Then, in step S706d, The memory controller writes inverted data corresponding to current data (eg, 512 bits corresponding to inverted data D′[511:0]) to eight 4-bit data chips. Then, in step S706e, the memory controller outputs and reads data again. Then, in step S706f, the memory controller compares the newly read data with the known inverted data to identify the location(s) of the error(s).
  • the current data e.g., 512 bits corresponding to the current data
  • step S706 the memory controller determines whether all detected errors correspond to one chip in step S707. If only one data chip is erased, or multiple DQ errors are detected, but all multiple DQ errors correspond to the same chip, in step S708, the memory controller will write data back to the remaining, unerased chips. , and all subsequent memory processes will use the parity bits of a simple parity algorithm to recover the data corresponding to the erased chip. For example, upon reading the data, if the memory controller determines that multiple bits corresponding to one of the data chips do not match the corresponding bits of the inverted data written to the data chip in step S706d, the memory controller in step S708 will mark that data chip as erased.
  • the memory controller determines that only one of the bits corresponding to one pin of one of the data chips does not match the corresponding bit of the inverted data written to the data chip in step S706d. Then, in step S708, the memory controller will mark only one pin of one chip as erased.
  • step S709 the memory controller will determine that a fatal and uncorrectable error has occurred. That is, if there are errors in more than one chip, the DRAM will no longer have sufficient resources to perform parity, and there will be no additional chips in reserve, so any additional chip errors will be corrected requiring replacement of the DRAM. It can't be done, it would be a fatal error.
  • step S710 the memory controller by receiving an additional burst from the DRAM. Retry reading data from data chips.
  • step S710 After retrying to read the data in step S710, the memory controller determines whether there is still an error detected in step S711. If no additional error is detected in step S711, in step S712 the memory controller determines that a soft error or a transient error has occurred and correction is no longer required. However, if an additional error is detected in step S711 after the retry of reading data in step S710, it is determined in step S713 whether the error has the same error pattern as the error initially detected in step S701. By determining whether the errors have the same error pattern, the case of inconsistent soft errors (i.e., the case in which the soft error was initially detected in step S701) will be ignored, and any subsequent soft errors in step S710 of the memory controller will be ignored. Detected while retrying to read data from
  • step S713 If the error detected in step S713 does not have the same error pattern as the error initially detected in step S701, the process returns to step S710, and the memory controller retries reading data from the data chips again. Thus, if sequential and unequal soft errors occur, resulting in another error pattern, the memory controller can continue trying to read the data. However, if the error detected in step S713 has the same error pattern as the error initially detected in step S701, it is determined that a hard error (i.e., a non-temporary error) has occurred in step S714.
  • a hard error i.e., a non-temporary error
  • the DRAM corrects the hard error using the memory controller ECC mechanism in step S715. triggers the memory controller to support
  • the memory controller may, for example, support error correction by comparing the current address of the error (eg, chip and pin address) with an error address stored in an error register of one or more chips of the DRAM.
  • the memory controller may also support error correction using ECC information in DRAM. For example, if ECC information in DRAM indicates that only one chip has an ECC error in DRAM that cannot be corrected, the memory controller can use the ECC chip to repair the error.
  • step S716 it is determined whether the memory controller has successfully corrected the hard error. If the error correction is successful, in step S717, the operating system may record the error event, and the memory controller may erase ECC information (eg, chip error register) in DRAM in each chip to another specific command can be issued.
  • ECC information eg, chip error register
  • step S709 the memory controller will determine that a fatal error has occurred in step S709.
  • the operating system may perform system/application level error recovery. For example, an operating system can retire a physical page by relocating the contents of the page to another physical page, and the retired page is a physical page that should not subsequently be allocated by the virtual memory system. can be placed in a list of . As the number of retired physical pages increases (eg, as the number of uncorrectable errors increases), the effective memory capacity of the system decreases.
  • embodiments of the present invention provide an architecture capable of providing the same basic chipkill RAS features as provided by DDR4, with reduced (eg, minimal) ECC chip overhead (ie, one per memory channel). of ECC chip), reduced (e.g., minimum) internal prefetch size (prefetch size is the same size as DRAM internal ECC), and reduced (e.g., It is possible to provide these features with minimal) changes. Additionally, embodiments of the present invention provide 125% storage overhead for the new DDR interface with a narrower channel width than DDR4, support basic chip kill capabilities and system ECC, and support memory controller support error detection mechanisms. do.
  • embodiments of the present invention can provide basic chip kill capability and system ECC despite having only one ECC chip per memory channel.
  • Embodiments of the present invention also provide a mechanism for the memory controller to identify a failed chip with support from the DRAM device, and a mechanism for outputting ECC information in the DRAM (e.g., additional burst length, additional pins, or by using register outputs from DRAM), provide a retry mechanism to identify the type of error (e.g., soft error or hard error), and independent or lock-step memory
  • ECC capabilities are provided using one of the channels and using either SEC-DEC or chipkill ECC.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Hardware Design (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

L'invention concerne un procédé de correction d'erreur de mémoire d'un module de mémoire vive dynamique (DRAM) au moyen d'une interface à double débit de données (DDR). Le procédé comprend les étapes consistant à : mettre en œuvre une transaction de mémoire comprenant de multiples rafales avec un contrôleur de mémoire afin de transférer des données à partir de puces de données de la DRAM vers un contrôleur de mémoire ; détecter une ou plusieurs erreurs au moyen d'une puce ECC de la DRAM ; déterminer le nombre de rafales comprenant les erreurs, au moyen de la puce ECC de la DRAM ; déterminer si le nombre de rafales comprenant les erreurs est supérieur à un nombre seuil ; déterminer un type des erreurs ; et fournir une instruction au contrôleur de mémoire sur la base du type déterminé des erreurs, la DRAM comprenant une puce ECC par canal de mémoire.
PCT/KR2021/018371 2021-12-06 2021-12-06 Procédé de correction d'erreur assistée par dram au moyen d'une interface sdram ddr WO2023106434A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2021-0173109 2021-12-06
KR20210173109 2021-12-06

Publications (1)

Publication Number Publication Date
WO2023106434A1 true WO2023106434A1 (fr) 2023-06-15

Family

ID=86730649

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/018371 WO2023106434A1 (fr) 2021-12-06 2021-12-06 Procédé de correction d'erreur assistée par dram au moyen d'une interface sdram ddr

Country Status (1)

Country Link
WO (1) WO2023106434A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090118031A (ko) * 2007-06-28 2009-11-17 인터내셔널 비지네스 머신즈 코포레이션 메모리 시스템에서의 에러 정정 및 검출을 위한 시스템 및 방법
KR20160030978A (ko) * 2013-07-31 2016-03-21 후아웨이 테크놀러지 컴퍼니 리미티드 메시지-타입 메모리 모듈을 위한 액세스 방법 및 디바이스
KR20180019473A (ko) * 2016-08-15 2018-02-26 삼성전자주식회사 Ddr sdram 인터페이스를 위한 dram 지원 에러 정정 메커니즘
KR20190049710A (ko) * 2016-09-30 2019-05-09 인텔 코포레이션 메모리에서 에러 체킹 및 정정 코드의 확장된 적용
KR20210055793A (ko) * 2018-10-16 2021-05-17 마이크론 테크놀로지, 인크 에러 정정 방법 및 디바이스

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090118031A (ko) * 2007-06-28 2009-11-17 인터내셔널 비지네스 머신즈 코포레이션 메모리 시스템에서의 에러 정정 및 검출을 위한 시스템 및 방법
KR20160030978A (ko) * 2013-07-31 2016-03-21 후아웨이 테크놀러지 컴퍼니 리미티드 메시지-타입 메모리 모듈을 위한 액세스 방법 및 디바이스
KR20180019473A (ko) * 2016-08-15 2018-02-26 삼성전자주식회사 Ddr sdram 인터페이스를 위한 dram 지원 에러 정정 메커니즘
KR20190049710A (ko) * 2016-09-30 2019-05-09 인텔 코포레이션 메모리에서 에러 체킹 및 정정 코드의 확장된 적용
KR20210055793A (ko) * 2018-10-16 2021-05-17 마이크론 테크놀로지, 인크 에러 정정 방법 및 디바이스

Similar Documents

Publication Publication Date Title
KR102191223B1 (ko) Ddr sdram 인터페이스를 위한 dram 지원 에러 정정 메커니즘
CN108268340B (zh) 校正存储器中的错误的方法
EP2311043B1 (fr) Procédé et appareil pour réparer des dispositifs de mémoire à capacité élevée/bande passante élevée
US7096407B2 (en) Technique for implementing chipkill in a memory system
US8811065B2 (en) Performing error detection on DRAMs
US6598199B2 (en) Memory array organization
JP2021509499A (ja) メモリ・コントローラを動作させる方法、デュアル・チャネル・モードからシングル・チャネル・モードに切り替える方法、およびメモリ・コントローラ
EP1194849B1 (fr) Systeme et procede ameliorant la protection multibit contre les erreurs dans des memoires d'ordinateurs
WO2023106434A1 (fr) Procédé de correction d'erreur assistée par dram au moyen d'une interface sdram ddr
CN116486891A (zh) 用于cxl驱动器中的高ras特征的具有crc+raid架构的影子dram、系统及方法
JPH02159649A (ja) 記憶回路

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21967307

Country of ref document: EP

Kind code of ref document: A1