WO2016122515A1 - Erasure multi-checksum error correction code - Google Patents

Erasure multi-checksum error correction code Download PDF

Info

Publication number
WO2016122515A1
WO2016122515A1 PCT/US2015/013460 US2015013460W WO2016122515A1 WO 2016122515 A1 WO2016122515 A1 WO 2016122515A1 US 2015013460 W US2015013460 W US 2015013460W WO 2016122515 A1 WO2016122515 A1 WO 2016122515A1
Authority
WO
WIPO (PCT)
Prior art keywords
parity
checksum
erasure
row
data
Prior art date
Application number
PCT/US2015/013460
Other languages
French (fr)
Inventor
Han Wang
Patrick A. Raymond
Raghavan V. Venugopal
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/013460 priority Critical patent/WO2016122515A1/en
Publication of WO2016122515A1 publication Critical patent/WO2016122515A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C16/00Erasable programmable read-only memories
    • G11C16/02Erasable programmable read-only memories electrically programmable
    • G11C16/06Auxiliary circuits, e.g. for writing into memory
    • G11C16/34Determination of programming status, e.g. threshold voltage, overprogramming or underprogramming, retention
    • G11C16/3436Arrangements for verifying correct programming or erasure
    • G11C16/344Arrangements for verifying correct erasure or for detecting overerased cells
    • G11C16/3445Circuits or methods to verify correct erasure of nonvolatile memory cells
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/52Protection of memory contents; Detection of errors in memory contents

Definitions

  • Advanced error correcting code (ECC) computer memory technology can protect computer memory systems from any single memory chip failure as well as multi- bit errors from any portion of a single memory chip.
  • ECC error correcting code
  • One example scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This scheme can allow stored data to be reconstructed from memory despite a complete failure of one chip.
  • FIG. 1 illustrates an example system to perform an erasure multi- checksum product code (EraMC) error-correcting code (ECC) scheme.
  • EraMC erasure multi- checksum product code
  • ECC error-correcting code
  • FIG. 2 illustrates another example system to perform an EraMC ECC scheme.
  • FIG. 3 illustrates an example of a redundant array of independent disks (RAID) memory architecture.
  • FIG. 4 illustrates another example RAID memory architecture.
  • FIG. 5 illustrates an example cache line of multiple cartridges.
  • FIG. 6 illustrates an example checksum operation.
  • FIG. 7 illustrates an example erasure code operation.
  • FIG. 8 illustrates an example parity of erasure parity operation.
  • FIG. 9 illustrates an example EraMC ECC scheme operation.
  • FIG. 10 illustrates an example RAID memory architecture experiencing a failure.
  • FIG. 1 1 illustrates an example cache line of multiple cartridges
  • FIG. 12 is a flow diagram illustrating an example method of performing an example EraMC ECC scheme operation.
  • FIG. 13 illustrates a table providing a comparison of error correction techniques.
  • FIG. 14 is a block diagram of an example system to perform an EraMC
  • This disclosure relates to a system and method to provide improved error correction in a memory architecture.
  • Error-correcting code can be used to detect and correct internal data corruption variations. More specifically, the ECC scheme explained herein can be described as an Erasure Multi-Checksum Product Code (EraMC) ECC scheme capable of decreasing the number of chips to be activated in error correction, and increasing the error correction capability with a lower computation complexity. This is achieved by combining and comparing checksum and erasure code operations, thereby leveraging redundancies experienced though each operation. As a result, the EraMC ECC scheme described herein achieves better fault-tolerance capability with relatively less computational complexity, fewer memory assets activated with the associated power savings, and a more efficient architecture format for RAID memory.
  • EraMC Erasure Multi-Checksum Product Code
  • error conditions e.g., single bit errors, double bit errors, 4-bit dynamic random access memory (DRAM), 8-bit DRAM, single long burst, two-random single bit, multi-random single bits
  • DRAM 4-bit dynamic random access memory
  • 8-bit DRAM single long burst, two-random single bit, multi-random single bits
  • the Inter-Cell-lnterference (ICI) may be of greater concern such as, for example, the probability of multiple random small burst errors increases.
  • Some solutions may not be able to correct specific errors due to the nature of the various codes in use.
  • Alternative candidate codes are longer, large-symbol codes that are based on larger finite fields.
  • computational complexity for error correction services increases leading to a much higher decoding latency, requiring increased power consumption.
  • the EraMC ECC scheme disclosed herein exhibits additional robustness and power efficiencies relative to these and other possible error detection and correction techniques.
  • a module can refer to hardware (e.g., one or more discrete components, circuits or circuit systems), firmware or software, or a combination of hardware, software and/or firmware.
  • computing module 100 can be a general purpose computing platform (e.g., computer, server, a data center or the like), an application specific integrated chip (ASIC), a field-programmable gate array (FPGA) integrated circuit, a computer, server, or a combination of hardware and software.
  • ASIC application specific integrated chip
  • FPGA field-programmable gate array
  • Computing module 100 and the modules included therein can also be capable of receiving and executing machine readable instructions and data, and implementing processes as described herein, with respect to a given module.
  • the memory 1 10 can be a collection of memory cartridges 1 through N of volatile or non-volatile memory devices configured to receive and store data.
  • the memory 1 10 cartridges 1 -N may be implemented as dynamic random access memory (DRAM), such as synchronous DRAM (SDRAM) including, single data rate (SDR) SDRAM, double data rate (DDR) SDRAM (e.g., DDR2, DDR3, DDR4, or other DDR versions).
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • SDR single data rate
  • DDR double data rate SDRAM
  • the memory 1 10 can also include other types of memory (e.g., reduced latency DRAM, and in some examples can include a combination of multiple different types of memory.
  • Computing module 100 can be connected to memory 1 10 via an interface 1 12 that facilitates data flow, for example, an electrical or optical bus, or a wireless communications network.
  • the computing module 100 includes a parity generation module 120 that includes a checksum module 122 and an erasure parity module 124
  • the checksum module 122 is configured (e.g., hardware and/or machine readable instructions) to perform a checksum operation (e.g., checksum routine) on each row of data in each of the plurality of memory devices.
  • the checksum module can include (or be configured to employ) a plurality of different checksum routines 123 that can be selectively applied to different rows.
  • different polynomials can be utilized for different rows to help ensure different types of errors can be detected over time.
  • the checksum module 122 thus can provide respective checksum outputs for each row in each of a plurality of memory devices implemented in the memory 1 10.
  • checksum candidate code is the cyclic redundancy check (CRC) code, although other checksum routines 123 (e.g., Adler-32, Fletcher's checksum or the like) can additionally or alternatively be employed.
  • CRC cyclic redundancy check
  • the checksum outputs can specify checksum parity and be stored as parity data 140.
  • the check sum outputs e.g., checksum parity
  • the parity data 140 can be stored in the memory 1 10 or, in other examples, a separate memory structure.
  • the erasure parity module 124 (e.g., hardware and/or machine readable instructions) performs a parity operation on each column of data in each of the plurality of memory devices in the memory 1 10. In some examples, the erasure parity module 124 also performs the parity operation on each row of data in the memory devices of memory 1 10. The erasure parity module 124 thus can provide respective parity outputs for each row and/or column, which can be stored as parity data 140, to provide an indication of parity for each column of each of the memory devices in memory 1 10. The particular parity operation implemented by the erasure parity module 124 can depend on whether the data is stored in the memory 1 10 as single bit or multi- bit data. Thus the type of memory can control the parity operation. As one example, the parity operation for single bit data is an exclusive-OR (XOR) operation, although other parity operations can be utilized, such as disclosed herein.
  • XOR exclusive-OR
  • the parity generation module 120 is configured to receive outputs from each of the checksum module 122 (e.g., checksum outputs) and erasure parity module 124 (e.g., parity outputs for each column or each column and row), and to perform calculations and determinations thereon. Thus, the parity generation module 120 can analyze the outputs from each of the checksum module 122 and the erasure parity module 124 to evaluate the memory 1 10 and determine if an error is present. Thus, if the parity generation module 120 determines an error is present, the number and location of each error can be analyzed to determine if and what type of further action may be required.
  • the checksum module 122 e.g., checksum outputs
  • erasure parity module 124 e.g., parity outputs for each column or each column and row
  • Output from the parity generation module 120 can be stored as parity data 140, such as to specify checksum parity and PEP for the memory 1 10 in a table or other data structure.
  • the checksum parity and/or PEP is stored in a predefined location of a table in the memory 1 10 itself.
  • the EraMC ECC scheme employs erasure multi-checksum product codes from the checksum module 122.
  • a checksum is a small-size block of digital data employed for the purpose of detecting errors which may have been introduced during data transmission or data storage.
  • checksum codes can be applied to a file or other forms of data received from a server. If the computed checksum (e.g., checksum outputs from the checksum module 122) for the current data input matches the stored value of a previously computed checksum, there is a high probability that the data has not been altered or corrupted.
  • the checksum module 122 can generate outer codes implemented as multi-checksum codes to detect and locate memory devices with errors.
  • the goal of using different and/or multiple checksum codes for different chips is to ensure that, if a chip encounters an undetectable error pattern, the undetectable error pattern can be propagated into other chips for a different form of checksum parity check.
  • hidden error patterns for a particular chip will not be overlooked by different forms of checksum coding implemented by other chips.
  • EraMC ECC provides a robust error correction scheme.
  • inner codes are
  • outer codes and inner codes operate jointly to provide chipkill (e.g., advanced ECC) recovery on the memory 1 10.
  • FIG. 2 illustrates another example error correction system 90 that includes a computing module 100 to perform an EraMC ECC scheme on a memory 1 10.
  • the computing module 100 includes a parity generation module 120, which can be implemented by machine readable instructions executable by a processing resources (e.g., one or more processing units), as hardware or a combination of hardware and software.
  • the parity generation module 120 includes a checksum module 122 and an erasure parity module 124.
  • the parity generation module 120 is configured to process outputs from each of the checksum module 122 and erasure parity module 124, as disclosed with respect to FIG. 1 , for example.
  • the checksum module 122 can evaluate checksum parity correctness following checksum performance of checksum routines 123 from the checksum module 122.
  • the erasure parity module 124 can include a parity routine module 125 to evaluate erasure parity following performance from the parity routine 125.
  • the parity routine 125 can be implemented as a symmetric Boolean function whose value depends on the number of ones in the input vector (e.g., parity routine for two inputs is an Exclusive-OR function).
  • the parity generation module 120 can compare the outputs from each of the checksum module 122 and the erasure parity module 124 to evaluate the checksum parity and the PEP to ascertain if an error is present in the memory 1 10, and to implement appropriate corrections.
  • the parity generation module 120 further includes an error pattern detection module 126 to check the checksum parity and the EP of the data in the memory 1 10.
  • the error pattern detection module 126 thus can ensure a match with the written data, the EP and the checksum parity.
  • the error pattern detection module 126 thus can determine the number of column errors and row errors.
  • the parity generation module 120 can also include a comparison module 128 to compare the determined result from the error pattern detection module 126 of each of the checksum module 122 and the erasure parity module 124 for errors.
  • parity generation module 120 can access the results from comparison module 128 (e.g., stored in parity data 140) and evaluate the number and location of each error to determine if further action is required.
  • the parity generation module 120 can include an erasure decoding module 129 and a Chase decoder module 130.
  • Other ECC can also be used, such as Reed-Solomon (RS) codes, Cauchy-Reed-Solomon (CRS) and Vandermonde-RS to name a few.
  • RS Reed-Solomon
  • CRS Cauchy-Reed-Solomon
  • Vandermonde-RS Vandermonde-RS
  • the comparison module 128 determines no column errors or row errors are detected, no further action is taken, as the absence of column errors indicates that either no error exists or any error is uncorrectable. In another example, if the comparison module 128 determines that the number of column and row errors is equal, then there are likely multi-random single bit errors. Thus, the parity of generation module 120 invokes the Chase decoder module 130 to perform a Chase routine to correct the detected random single bit errors.
  • the comparison module 128 determines the number of column errors to be much greater than the number of row errors, then a chipkill condition is identified in the compare output. In response to detecting the chipkill condition from the compare output, erasure decoding is performed via an erasure decoding module 129.
  • the comparison module 128 determines only column errors are detected by the error pattern detection module 126 (e.g., no row errors detected by the checksum module), then there is likely a miss detection (e.g., a failure to detect an existing error). If a miss detection is identified, the erasure decode module 129 performs erasure decoding on each row of the memory devices in an iterative manner.
  • the checksum parity is re-checked by the checksum module 122 to identify the row that contains a miss detection.
  • an error miss detection module 132 will perform an error miss detection test to identify each row that contains one or more miss detection. For example, the error miss detection module 132 will perform an error miss detection test to evaluate the new output from the checksum (e.g., checksum parity) and the erasure decoding applied to each respective row to identify one or more rows containing a miss detection error. Once a row with the error is identified, the erasure parity module can perform erasure decoding on the identified row.
  • the checksum e.g., checksum parity
  • the error miss detection module 132 can be invoked to perform a miss detection test following completion of the error correction to ensure no further miss detection exists. However, if the miss detection test determines that there is no miss detection, then the memory is considered error free and no further action is needed.
  • ECC Error Correction Code
  • advanced ECC is a form of error checking and correcting computer memory technology that protects computer memory systems from any single memory chip failure, as well as multi-bit errors from any portion of a single memory chip.
  • chipkill scatters bits of code ECC words across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of a single chip.
  • chipkill if a chip fails or has exceeded a threshold of bit errors, a redundant memory chip is used to replace the failed chip.
  • chipkill is able to correct multiple bits with less overhead than conventional ECC methods.
  • chipkill type ECC memory mechanisms have several drawbacks, such that the system requires activation of a large number of chips to provide chipkill level data protection. Increases in the number and complexity of chipkill systems will continue if an unmodified ECC scheme is maintained.
  • a cartridge in a memory array fails (e.g., failure of a cartridge 1 -N of memory 1 10 of FIG. 1 )
  • remaining data on the other cartridges within the memory array can be combined with the parity data (e.g., using the parity routine 125 and explained below with reference to FIG. 7) to reconstruct the missing data.
  • the parity routine is an XOR operation performed on a given cartridge's data to calculate parity data for the given cartridges. The resulting parity data is then stored on a redundant cartridge. Should any of the cartridges fail, the contents of the failed cartridge can be reconstructed on a replacement cartridge by subjecting the data from the remaining cartridges to the same parity routine.
  • the checksum routine breaks the data message into words, each with a fixed number of bits, and then computes the XOR of the words in each column. The result of the XOR is appended to the message as an extra word (e.g., stored in each column of memory based on the computed XOR for each respective row). To check the integrity of a message, the receiver computes the XOR of the words, including the checksum. If the result is not a word with an equal number of zeros, the receiver knows a transmission error has occurred for the data.
  • Examples of codes for error correction include Reed-Solomon (RS) codes, which are codes to detect and correct multiple random errors within a memory system.
  • RS codes which are codes to detect and correct multiple random errors within a memory system.
  • Specialized forms of RS codes specifically Cauchy-Reed-Solomon (CRS) and Vandermonde-RS, can be used to overcome the unreliable nature of data transmission over erasure channels.
  • example codes that can be implemented in the EraMC ECC scheme described herein are Cauchy-Reed-Solomon (CRS) codes using XOR operations on Galois Field (GF)(2) instead of GF(2 8 ), which serves to reduce
  • SCM Sparse Check Matrix
  • Parity data can be used in a Redundant Array of Independent Disks (RAID) memory architecture to perform error detection and correction, provide additional fault-tolerance capability, and achieve further system protection, schematic examples of which are demonstrated in FIGS. 3 and 4.
  • RAID data storage virtualization technology combines multiple, independent disk cartridge components into a logical unit for the purposes of data redundancy and/or performance improvements. For instance data is distributed across the individual cartridges in one of several ways, referred to as RAID levels, depending on the desired level of redundancy and performance requirements. Examples of RAID architectures can include RAID 0 (striping), RAID 1 and its variants (mirroring), RAID 5 (distributed parity), and RAID 6 (dual parity).
  • RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance on up to two failed cartridges. This makes RAID 6 more practical for high-availability systems, as large-capacity drives take longer to restore. For instance, with a RAID 6 array, it is possible to mitigate most of the problems associated with lower RAID levels. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 over lower RAID levels.
  • a RAID 6, for example can employ parity, which is an error protection scheme to provide fault tolerance in a given set of data. RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois Field (GF) or Reed-Solomon (RS) error correction. However, employing RAID memory architecture by itself may add an additional 25% storage overhead or more to a given memory system.
  • GF Galois Field
  • RS Reed-Solomon
  • the overhead of RAID systems contributing to overall system overhead is quite significant.
  • the RAID memory has five cartridges arranged as a series of columns, A1 -D1 to A4-D4, and a fifth cartridge populated for parity redundancy.
  • the overhead introduced by this setup is therefore approximately 25%.
  • Each column is divided into rows (e.g., A, B, C, and D).
  • Overhead refers to the processing time required by system software, which includes the operating system and any utility that supports application programs.
  • overhead refers to the processing time required by codes for error checking and control of transmissions.
  • the combined scheme involving RAID parity with the EraMC ECC code words reduces the RAID memory overhead to approximately zero.
  • the EraMC ECC scheme described herein enables ECC to leverage the redundancies of a RAID architecture to provide error correction that requires less power consumption and fewer computing resources. Accordingly, the EraMC ECC scheme achieves both power and computation efficiency for ECC memory, and format efficiency for a RAID memory architecture.
  • the burst length is eight, and each cache line has 64-bytes.
  • the EraMC ECC scheme described herein divides the cache line (e.g., the whole codeword) into N number of elements, each to be written into different memory cartridges for a RAID 6 memory architecture with eight cartridges. For each memory channel, the probability that dual in-line memory module (DIMM) failure and random errors occur (e.g., in a chip on a different DIMM) concurrently for that particular read is therefore significantly lowered.
  • DIMM dual in-line memory module
  • FIG. 5 Another example of a memory architecture implementing the EraMC ECC scheme is illustrated in FIG. 5, expanding the cache line across each of the eight cartridges, shown as rows A1 to A8. Each cartridge has another eight portions arranged as columns. Columns one through are six denoted by 200 to 205 being 1 byte of data. Column seven is split between 8 bit data 210 and checksum parity (e.g., determined by checksum module 122), shown as 220. Erasure parity (EP) 230 occupies the eighth column of rows A1 -A7. A parity of erasure parity (PEP) at 240 occupies the eighth column of row A8.
  • EP Erasure parity
  • PEP parity of erasure parity
  • FIGS. 6-9 illustrate the EraMC ECC scheme being performed on a single cache line, such as the cache line illustrated in FIG. 5.
  • the parity generation module 120 in communication with the plurality of memory devices 1 10, as described with reference to FIGS. 1 and 2, can be used to implement the EraMC ECC scheme described in FIGS. 6-9.
  • FIG. 6 illustrates an example checksum routine 212 performed on each row A1 to A8 across columns 200-205 of each cartridge (e.g., a checksum routine selected from multi-checksum product codes performed by checksum module 122).
  • the result of the checksum routine is the checksum parity output 220 for each row.
  • an example checksum routine is the multi-CRC checksum with different polynomials, although other checksum routines can be employed.
  • a different checksum routine can be performed on each of the different rows.
  • different CRC checksum polynomials e.g., routines 123 are applied to different rows to enhance the robustness of the results.
  • FIG. 7 illustrates an example of single chipkill operation 232 performed on each column of the memory 1 10 using a single parity erasure code, the parity of which is generated by the parity routine 125 (e.g., an XOR operation) from the erasure parity module 124. The result is EP for each column (e.g., stored as parity data 140).
  • FIG. 8 provides an example parity routine 234 performed to generate parity of erasure parity (PEP) 240, such as can be computed based on applying a given parity routine to both data in the row EP and the column EP (e.g., XOR'ing the row EP and column EP).
  • PEP erasure parity
  • FIG. 9 illustrates a PEP product 240 as the result of an EraMC ECC operation.
  • the PEP product 240 can be analyzed by error pattern detection module 126 and comparison module 128 to identify and correct errors (e.g., by use of erasure decode module 129, Chase decoder module 130, and error miss detection module 132).
  • RAID memory using RAID levels 5 and 6 introduce significant RAID overhead.
  • RAID 5 and 6 (illustrated in, e.g., FIGS. 3 and 4, respectively) employ erasure codes (e.g., from erasure parity module 124 of FIG. 1 ) to generate RAID parity, which is the same coding as employed in the scheme described above. Accordingly, by use of the EraMC ECC scheme described herein, the erasure codes (e.g., generated by erasure parity module 124 of FIG. 1 ) can overlap with RAID parity. Accordingly, the EraMC ECC scheme fully utilizes the erasure code in ECC with the parity of the RAID memory architecture to limit redundancies and improve both power and processing efficiencies.
  • single DIMM failure contributes essentially one chipkill in codewords creating a failure of the fourth cartridge A4-N4, as illustrated in FIG. 10.
  • the chipkill can be recovered, for example, by implementing the checksum routine 212 and erasure parity code 232 in accordance with the EraMC ECC scheme described herein and provided in FIGS. 6 and 7, respectively. Accordingly, a comparison operation can be performed (e.g., by comparison module 128) to evaluate checksum parity and PEP and identify errors in the memory (e.g., by error pattern detection module 126 and error miss detection module 132).
  • EraMC ECC Erasure decode module 129, Chase decoder module 130 or other correcting coding techniques.
  • erasure decode module 129 Erasure decode module 129
  • Chase decoder module 130 EraMC ECC
  • the EraMC ECC scheme described herein can use, for example, a Vander Monde matrix on finite fields as the parity routine 125 to generate parity.
  • this implementation can be applied to a RAID 6 configuration by using RAID 6 parity-generation modules (e.g., parity generation module 120 of FIGS. 1 and 2) for the proposed erasure codes. Consequently, up to two DIMM failures will be similar or equivalent and leveraged to deal with double chipkills.
  • FIG. 12 illustrates an example method 250 for an ECC decoding method.
  • the method 250 can be performed on the system 90 described in FIG. 1 .
  • checksum parity for each row of data in a memory device e.g., memory 1 10) is generated by performing a checksum operation by, for example, the checksum module 122 of FIG. 1 .
  • erasure parity for each column of data in the memory device is generated by performing a parity operation by, for example, the erasure parity module 124 of FIG. 1 to evaluate parity of EP (PEP) with EP.
  • PEP parity of EP
  • the checksum parity and the erasure parity are evaluated by, for example, the error pattern detection module 126 of FIG. 1 .
  • the number of column errors and row errors are determined and analyzed.
  • a determination is made whether the memory device contains an error. If no errors are detected, then the memory is considered error free. Errors that are detected may be corrected as disclosed herein.
  • FIG. 13 provides a table to compare the capabilities of several error correction techniques with the EraMC ECC described herein.
  • the first column lists error conditions common to memory cartridges.
  • the second column indicates the capabilities of Single Parity error correction, followed by Standard ECC, RAID Memory, and the final column listing the error correction capabilities of the EraMC ECC.
  • Each of Single Parity, Standard ECC, and RAID Memory are subject to unreliable results. In other words, an unreliable result indicates that the error cannot be detected or corrected.
  • a detection indicates that the error exists but the location in the memory structure remains unknown (e.g., a string error).
  • a correct response indicates that the nature and location of the error have both been identified.
  • the EraMC ECC is a more robust error correction scheme, providing far greater capability to correct and/or detect an error.
  • application of the EraMC ECC on a memory device allows for a combination of checksum parity, erasure parity, error detection and missdetection, and error correction that is not found in other error correction techniques.
  • no technique other than EraMC ECC is capable of achieving error detection and/or correction in a reliable manner.
  • FIG. 14 is a block diagram of an example system 300 to perform an EraMC ECC error correction scheme.
  • System 300 may be similar to system 90 of FIGS. 1 and 2, for example.
  • system 300 includes a processor 304 and a non-transitory computer readable medium 306.
  • the following descriptions refer to a single processor and a single machine-readable storage medium, the descriptions may also apply to a system with multiple processors and multiple machine-readable storage mediums.
  • the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.
  • Processor 304 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in non-transitory computer readable medium 306.
  • processor 304 may fetch, decode, and execute instructions 308, 310, 312, 314, 316, 318, 320, 322 to perform EraMC ECC error correction.
  • processor 304 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in non-transitory computer readable medium 306.
  • executable instruction representations e.g., boxes
  • executable instructions and/or electronic circuits included within one box may, in alternate embodiments, be included in a different box shown in the figures or in a different box not shown.
  • Non-transitory computer readable medium 306 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions.
  • non-transitory computer readable medium 306 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like.
  • RAM Random Access Memory
  • EEPROM Electrically-Erasable Programmable Read-Only Memory
  • Non-transitory computer readable medium 306 may be disposed within system 300, as shown in FIG. 14. In this situation, the executable instructions may be "installed" on the system 300.
  • non-transitory computer readable medium 306 may be a portable, external or remote storage medium, for example, that allows system 300 to download the instructions from the portable/external/remote storage medium. In this situation, the executable instructions may be part of an "installation package".
  • non-transitory computer readable medium 306 may be encoded with executable instructions for performing EraMC E
  • checksum parity generation instructions 308 when executed by a processor (e.g., processor 304), may cause system 300 to generate checksum parity for a memory 302.
  • Memory 302 may be similar to memory 1 10 of FIGS. 1 and 2, for example.
  • Erasure parity generation instructions 310 when executed by a processor (e.g., processor 304), may cause system 300 to generate erasure parity for a memory 302.
  • memory device error determination instructions 312 when executed by a processor (e.g., processor 304), may cause system 300 to determine to whether memory 302 contains an error. This determination can be the result of row error determination instructions 314 and column error determination instructions 316, which collectively determine the number of row and column errors within memory 302, respectively.
  • erasure decoding instructions 318 when executed by a processor (e.g., processor 304), may cause system 300 to perform erasure decoding for memory 302.
  • a determination from instruction 312 that the number of row error and column error is equal may cause system 300 to select other decoding instructions (e.g., Chase decoding instructions) 320, when executed by a processor (e.g., processor 304), to perform Chase decoding for memory 302.
  • a determination from instruction 312 that no row errors exist may cause system 300 to perform miss detection identification instructions 322, when executed by a processor (e.g., processor 304), to identify a miss detection in memory 302.

Landscapes

  • Detection And Correction Of Errors (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

A system includes a plurality of memory of devices to store data. The system further includes a parity generation module including a checksum module to perform a checksum operation on each row of data in each of the plurality of memory devices to provide respective checksum outputs. The parity generation module further comprises an erasure parity module to perform an erasure parity operation on each column of data in each of the plurality of memory devices to provide respective erasure parity outputs. The parity generation module can evaluate the checksum output and the erasure parity output for each of the plurality of memory devices to provide parity data to indicate whether a given memory device of the plurality of memory devices contains an error.

Description

ERASURE MULTI-CHECKSUM ERROR CORRECTION CODE
BACKGROUND
[0001] Advanced error correcting code (ECC) computer memory technology can protect computer memory systems from any single memory chip failure as well as multi- bit errors from any portion of a single memory chip. One example scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This scheme can allow stored data to be reconstructed from memory despite a complete failure of one chip.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] FIG. 1 illustrates an example system to perform an erasure multi- checksum product code (EraMC) error-correcting code (ECC) scheme.
[0003] FIG. 2 illustrates another example system to perform an EraMC ECC scheme.
[0004] FIG. 3 illustrates an example of a redundant array of independent disks (RAID) memory architecture.
[0005] FIG. 4 illustrates another example RAID memory architecture.
[0006] FIG. 5 illustrates an example cache line of multiple cartridges.
[0007] FIG. 6 illustrates an example checksum operation.
[0008] FIG. 7 illustrates an example erasure code operation.
[0009] FIG. 8 illustrates an example parity of erasure parity operation.
[0010] FIG. 9 illustrates an example EraMC ECC scheme operation.
[0011] FIG. 10 illustrates an example RAID memory architecture experiencing a failure.
[0012] FIG. 1 1 illustrates an example cache line of multiple cartridges
experiencing a failure.
[0013] FIG. 12 is a flow diagram illustrating an example method of performing an example EraMC ECC scheme operation. [0014] FIG. 13 illustrates a table providing a comparison of error correction techniques.
[0015] FIG. 14 is a block diagram of an example system to perform an EraMC
ECC scheme.
DETAILED DESCRIPTION
[0016] This disclosure relates to a system and method to provide improved error correction in a memory architecture. Error-correcting code (ECC) can be used to detect and correct internal data corruption variations. More specifically, the ECC scheme explained herein can be described as an Erasure Multi-Checksum Product Code (EraMC) ECC scheme capable of decreasing the number of chips to be activated in error correction, and increasing the error correction capability with a lower computation complexity. This is achieved by combining and comparing checksum and erasure code operations, thereby leveraging redundancies experienced though each operation. As a result, the EraMC ECC scheme described herein achieves better fault-tolerance capability with relatively less computational complexity, fewer memory assets activated with the associated power savings, and a more efficient architecture format for RAID memory. For example, a variety of different error conditions (e.g., single bit errors, double bit errors, 4-bit dynamic random access memory (DRAM), 8-bit DRAM, single long burst, two-random single bit, multi-random single bits) can be detected using this approach and, in turn, corrected in a reliable manner.
[0017] Additionally, as memory cell density increases, the Inter-Cell-lnterference (ICI) may be of greater concern such as, for example, the probability of multiple random small burst errors increases. Some solutions may not be able to correct specific errors due to the nature of the various codes in use. Alternative candidate codes are longer, large-symbol codes that are based on larger finite fields. Thus, computational complexity for error correction services increases leading to a much higher decoding latency, requiring increased power consumption. In contrast, the EraMC ECC scheme disclosed herein exhibits additional robustness and power efficiencies relative to these and other possible error detection and correction techniques. [0018] FIG. 1 illustrates an example of an error correction system 90 that includes a computing module 100 capable of performing the EraMC ECC scheme on a memory 1 10. A module, as described herein, can refer to hardware (e.g., one or more discrete components, circuits or circuit systems), firmware or software, or a combination of hardware, software and/or firmware. For example, computing module 100 can be a general purpose computing platform (e.g., computer, server, a data center or the like), an application specific integrated chip (ASIC), a field-programmable gate array (FPGA) integrated circuit, a computer, server, or a combination of hardware and software.
Computing module 100 and the modules included therein can also be capable of receiving and executing machine readable instructions and data, and implementing processes as described herein, with respect to a given module.
[0019] The memory 1 10 can be a collection of memory cartridges 1 through N of volatile or non-volatile memory devices configured to receive and store data. As an example, the memory 1 10 cartridges 1 -N may be implemented as dynamic random access memory (DRAM), such as synchronous DRAM (SDRAM) including, single data rate (SDR) SDRAM, double data rate (DDR) SDRAM (e.g., DDR2, DDR3, DDR4, or other DDR versions). The memory 1 10 can also include other types of memory (e.g., reduced latency DRAM, and in some examples can include a combination of multiple different types of memory. Computing module 100 can be connected to memory 1 10 via an interface 1 12 that facilitates data flow, for example, an electrical or optical bus, or a wireless communications network. The computing module 100 includes a parity generation module 120 that includes a checksum module 122 and an erasure parity module 124. The parity generation module 120 therefore can provide a checksum code, an erasure code, or both.
[0020] As an example, the checksum module 122 is configured (e.g., hardware and/or machine readable instructions) to perform a checksum operation (e.g., checksum routine) on each row of data in each of the plurality of memory devices. In some examples, the checksum module can include (or be configured to employ) a plurality of different checksum routines 123 that can be selectively applied to different rows. As an example instance, different polynomials can be utilized for different rows to help ensure different types of errors can be detected over time. The checksum module 122 thus can provide respective checksum outputs for each row in each of a plurality of memory devices implemented in the memory 1 10. An example checksum candidate code (e.g., checksum routine) is the cyclic redundancy check (CRC) code, although other checksum routines 123 (e.g., Adler-32, Fletcher's checksum or the like) can additionally or alternatively be employed. The checksum outputs can specify checksum parity and be stored as parity data 140. In some examples, the check sum outputs (e.g., checksum parity) for each row of a given memory structure 1 10 is stored in a specified column of such memory. Additionally, in some examples, the parity data 140 can be stored in the memory 1 10 or, in other examples, a separate memory structure.
[0021] As a further example, the erasure parity module 124 (e.g., hardware and/or machine readable instructions) performs a parity operation on each column of data in each of the plurality of memory devices in the memory 1 10. In some examples, the erasure parity module 124 also performs the parity operation on each row of data in the memory devices of memory 1 10. The erasure parity module 124 thus can provide respective parity outputs for each row and/or column, which can be stored as parity data 140, to provide an indication of parity for each column of each of the memory devices in memory 1 10. The particular parity operation implemented by the erasure parity module 124 can depend on whether the data is stored in the memory 1 10 as single bit or multi- bit data. Thus the type of memory can control the parity operation. As one example, the parity operation for single bit data is an exclusive-OR (XOR) operation, although other parity operations can be utilized, such as disclosed herein.
[0022] The parity generation module 120 is configured to receive outputs from each of the checksum module 122 (e.g., checksum outputs) and erasure parity module 124 (e.g., parity outputs for each column or each column and row), and to perform calculations and determinations thereon. Thus, the parity generation module 120 can analyze the outputs from each of the checksum module 122 and the erasure parity module 124 to evaluate the memory 1 10 and determine if an error is present. Thus, if the parity generation module 120 determines an error is present, the number and location of each error can be analyzed to determine if and what type of further action may be required. Output from the parity generation module 120 can be stored as parity data 140, such as to specify checksum parity and PEP for the memory 1 10 in a table or other data structure. In some examples, the checksum parity and/or PEP is stored in a predefined location of a table in the memory 1 10 itself.
[0023] The EraMC ECC scheme employs erasure multi-checksum product codes from the checksum module 122. A checksum is a small-size block of digital data employed for the purpose of detecting errors which may have been introduced during data transmission or data storage. For example, checksum codes can be applied to a file or other forms of data received from a server. If the computed checksum (e.g., checksum outputs from the checksum module 122) for the current data input matches the stored value of a previously computed checksum, there is a high probability that the data has not been altered or corrupted.
[0024] By way of example, the checksum module 122 can generate outer codes implemented as multi-checksum codes to detect and locate memory devices with errors. The goal of using different and/or multiple checksum codes for different chips is to ensure that, if a chip encounters an undetectable error pattern, the undetectable error pattern can be propagated into other chips for a different form of checksum parity check. Thus, hidden error patterns for a particular chip will not be overlooked by different forms of checksum coding implemented by other chips. As a result, EraMC ECC provides a robust error correction scheme. Additionally, inner codes are
generated by the erasure parity module 124 and implemented as erasure codes to perform error recovery and/or correction. In the EraMC ECC scheme disclosed herein, outer codes and inner codes operate jointly to provide chipkill (e.g., advanced ECC) recovery on the memory 1 10.
[0025] FIG. 2 illustrates another example error correction system 90 that includes a computing module 100 to perform an EraMC ECC scheme on a memory 1 10. For sake of consistency, like reference numbers are utilized in FIG. 2 to identify elements previously introduced with respect to FIG. 1 . Briefly stated, the computing module 100 includes a parity generation module 120, which can be implemented by machine readable instructions executable by a processing resources (e.g., one or more processing units), as hardware or a combination of hardware and software. The parity generation module 120 includes a checksum module 122 and an erasure parity module 124. The parity generation module 120 is configured to process outputs from each of the checksum module 122 and erasure parity module 124, as disclosed with respect to FIG. 1 , for example. The checksum module 122 can evaluate checksum parity correctness following checksum performance of checksum routines 123 from the checksum module 122. Additionally, the erasure parity module 124 can include a parity routine module 125 to evaluate erasure parity following performance from the parity routine 125. As an example, the parity routine 125 can be implemented as a symmetric Boolean function whose value depends on the number of ones in the input vector (e.g., parity routine for two inputs is an Exclusive-OR function). Additionally, the parity generation module 120 can compare the outputs from each of the checksum module 122 and the erasure parity module 124 to evaluate the checksum parity and the PEP to ascertain if an error is present in the memory 1 10, and to implement appropriate corrections.
[0026] As demonstrated in the example of FIG. 2, the parity generation module 120 further includes an error pattern detection module 126 to check the checksum parity and the EP of the data in the memory 1 10. The error pattern detection module 126 thus can ensure a match with the written data, the EP and the checksum parity. The error pattern detection module 126 thus can determine the number of column errors and row errors. The parity generation module 120 can also include a comparison module 128 to compare the determined result from the error pattern detection module 126 of each of the checksum module 122 and the erasure parity module 124 for errors. For example, parity generation module 120 can access the results from comparison module 128 (e.g., stored in parity data 140) and evaluate the number and location of each error to determine if further action is required.
[0027] Depending on the type of error detected (e.g., the number of columns and rows with detected errors), various decoding routines may be performed to correct the errors and/or reevaluate an identified error. In the example of FIG. 2, the parity generation module 120 can include an erasure decoding module 129 and a Chase decoder module 130. Other ECC can also be used, such as Reed-Solomon (RS) codes, Cauchy-Reed-Solomon (CRS) and Vandermonde-RS to name a few. The parity generation module 120 thus can selectively employ a different correction module depending on the compare output provided by the comparison module 128 based on comparing the outputs from the error pattern detection module 126.
[0028] As one example, if the comparison module 128 determines no column errors or row errors are detected, no further action is taken, as the absence of column errors indicates that either no error exists or any error is uncorrectable. In another example, if the comparison module 128 determines that the number of column and row errors is equal, then there are likely multi-random single bit errors. Thus, the parity of generation module 120 invokes the Chase decoder module 130 to perform a Chase routine to correct the detected random single bit errors.
[0029] In yet another example, if the comparison module 128 determines the number of column errors to be much greater than the number of row errors, then a chipkill condition is identified in the compare output. In response to detecting the chipkill condition from the compare output, erasure decoding is performed via an erasure decoding module 129. Alternatively, if the comparison module 128 determines only column errors are detected by the error pattern detection module 126 (e.g., no row errors detected by the checksum module), then there is likely a miss detection (e.g., a failure to detect an existing error). If a miss detection is identified, the erasure decode module 129 performs erasure decoding on each row of the memory devices in an iterative manner.
[0030] Following the erasure decoding step, the checksum parity is re-checked by the checksum module 122 to identify the row that contains a miss detection.
Because the patterns for each check sum code differ, the new checksum operation by checksum module 122 compared to the original checksum will indicate a corresponding miss detection error. In response, an error miss detection module 132 will perform an error miss detection test to identify each row that contains one or more miss detection. For example, the error miss detection module 132 will perform an error miss detection test to evaluate the new output from the checksum (e.g., checksum parity) and the erasure decoding applied to each respective row to identify one or more rows containing a miss detection error. Once a row with the error is identified, the erasure parity module can perform erasure decoding on the identified row. Thus, if column/row errors or miss detections have been identified, the error miss detection module 132 can be invoked to perform a miss detection test following completion of the error correction to ensure no further miss detection exists. However, if the miss detection test determines that there is no miss detection, then the memory is considered error free and no further action is needed.
[0031] One method of ECC to perform error correction is an advanced ECC technique (e.g., also known as chipkill or extended ECC). For instance, advanced ECC is a form of error checking and correcting computer memory technology that protects computer memory systems from any single memory chip failure, as well as multi-bit errors from any portion of a single memory chip. One form of chipkill scatters bits of code ECC words across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of a single chip. By employing chipkill, if a chip fails or has exceeded a threshold of bit errors, a redundant memory chip is used to replace the failed chip. Thus, chipkill is able to correct multiple bits with less overhead than conventional ECC methods. However, chipkill type ECC memory mechanisms have several drawbacks, such that the system requires activation of a large number of chips to provide chipkill level data protection. Increases in the number and complexity of chipkill systems will continue if an unmodified ECC scheme is maintained.
[0032] Current solutions employ a large number of activated memory chips concurrently, leading to huge demands on power consumption. A common result is an over-heated data center, which requires an extensive cooling system and higher costs. By the EraMC ECC scheme disclosed herein, a predetermined plurality of memory devices (e.g., eight memory cartridges) are activated for effective ECC. Thus, far fewer chips are activated than other ECC operations in chipkill memory that may employ. As one example, where 36 chips may be used for 4 DRAM in existing chipkill architecture, 18 chips may be used for 8 DRAM in the EraMC ECC scheme disclosed herein.
[0033] If a cartridge in a memory array fails (e.g., failure of a cartridge 1 -N of memory 1 10 of FIG. 1 ), remaining data on the other cartridges within the memory array can be combined with the parity data (e.g., using the parity routine 125 and explained below with reference to FIG. 7) to reconstruct the missing data. In some examples, the parity routine is an XOR operation performed on a given cartridge's data to calculate parity data for the given cartridges. The resulting parity data is then stored on a redundant cartridge. Should any of the cartridges fail, the contents of the failed cartridge can be reconstructed on a replacement cartridge by subjecting the data from the remaining cartridges to the same parity routine. Thus, if one of the cartridges were to fail, its data could be rebuilt using the parity routine results of the contents of the remaining cartridges. The result of that parity routine calculation yields the damaged cartridge's contents, which are then stored on a remaining cartridge, fully repairing the array of independent cartridges. This same parity routine concept can be applied to larger arrays, using any number of cartridges distributed across any number of memory devices.
[0034] In some examples, the checksum routine breaks the data message into words, each with a fixed number of bits, and then computes the XOR of the words in each column. The result of the XOR is appended to the message as an extra word (e.g., stored in each column of memory based on the computed XOR for each respective row). To check the integrity of a message, the receiver computes the XOR of the words, including the checksum. If the result is not a word with an equal number of zeros, the receiver knows a transmission error has occurred for the data.
[0035] Examples of codes for error correction include Reed-Solomon (RS) codes, which are codes to detect and correct multiple random errors within a memory system. Specialized forms of RS codes, specifically Cauchy-Reed-Solomon (CRS) and Vandermonde-RS, can be used to overcome the unreliable nature of data transmission over erasure channels. Thus, example codes that can be implemented in the EraMC ECC scheme described herein are Cauchy-Reed-Solomon (CRS) codes using XOR operations on Galois Field (GF)(2) instead of GF(28), which serves to reduce
computational complexity. As an alternative example, Sparse Check Matrix (SCM) codes can be implemented on GF(2). This SCM scheme can be readily implemented in hardware solutions in the system 90 by employing non-Galois Field operations for both single chipkill and double chipkill, for example.
[0036] Parity data can be used in a Redundant Array of Independent Disks (RAID) memory architecture to perform error detection and correction, provide additional fault-tolerance capability, and achieve further system protection, schematic examples of which are demonstrated in FIGS. 3 and 4. In these examples, the RAID data storage virtualization technology combines multiple, independent disk cartridge components into a logical unit for the purposes of data redundancy and/or performance improvements. For instance data is distributed across the individual cartridges in one of several ways, referred to as RAID levels, depending on the desired level of redundancy and performance requirements. Examples of RAID architectures can include RAID 0 (striping), RAID 1 and its variants (mirroring), RAID 5 (distributed parity), and RAID 6 (dual parity).
[0037] As an example, RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance on up to two failed cartridges. This makes RAID 6 more practical for high-availability systems, as large-capacity drives take longer to restore. For instance, with a RAID 6 array, it is possible to mitigate most of the problems associated with lower RAID levels. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 over lower RAID levels. A RAID 6, for example, can employ parity, which is an error protection scheme to provide fault tolerance in a given set of data. RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois Field (GF) or Reed-Solomon (RS) error correction. However, employing RAID memory architecture by itself may add an additional 25% storage overhead or more to a given memory system.
[0038] In comparison, the overhead of RAID systems contributing to overall system overhead is quite significant. For an example RAID 5 setup, as illustrated in FIG. 3, the RAID memory has five cartridges arranged as a series of columns, A1 -D1 to A4-D4, and a fifth cartridge populated for parity redundancy. The overhead introduced by this setup is therefore approximately 25%. Each column is divided into rows (e.g., A, B, C, and D). Overhead refers to the processing time required by system software, which includes the operating system and any utility that supports application programs. For example, overhead refers to the processing time required by codes for error checking and control of transmissions. In such a case, the combined scheme involving RAID parity with the EraMC ECC code words, in which the ECC redundancies overlap with RAID redundancies, reduces the RAID memory overhead to approximately zero. For example, the EraMC ECC scheme described herein enables ECC to leverage the redundancies of a RAID architecture to provide error correction that requires less power consumption and fewer computing resources. Accordingly, the EraMC ECC scheme achieves both power and computation efficiency for ECC memory, and format efficiency for a RAID memory architecture.
[0039] As a further example, for a double data rate fourth generation (DDR4) x8 dynamic random-access memory (DRAM) configuration, the burst length is eight, and each cache line has 64-bytes. As illustrated in the example of FIG. 4, the EraMC ECC scheme described herein divides the cache line (e.g., the whole codeword) into N number of elements, each to be written into different memory cartridges for a RAID 6 memory architecture with eight cartridges. For each memory channel, the probability that dual in-line memory module (DIMM) failure and random errors occur (e.g., in a chip on a different DIMM) concurrently for that particular read is therefore significantly lowered. Thus, when a DIMM fails, it is equivalent to a situation where a single chipkill condition occurs.
[0040] Another example of a memory architecture implementing the EraMC ECC scheme is illustrated in FIG. 5, expanding the cache line across each of the eight cartridges, shown as rows A1 to A8. Each cartridge has another eight portions arranged as columns. Columns one through are six denoted by 200 to 205 being 1 byte of data. Column seven is split between 8 bit data 210 and checksum parity (e.g., determined by checksum module 122), shown as 220. Erasure parity (EP) 230 occupies the eighth column of rows A1 -A7. A parity of erasure parity (PEP) at 240 occupies the eighth column of row A8.
[0041] FIGS. 6-9 illustrate the EraMC ECC scheme being performed on a single cache line, such as the cache line illustrated in FIG. 5. For example, the parity generation module 120 in communication with the plurality of memory devices 1 10, as described with reference to FIGS. 1 and 2, can be used to implement the EraMC ECC scheme described in FIGS. 6-9. FIG. 6 illustrates an example checksum routine 212 performed on each row A1 to A8 across columns 200-205 of each cartridge (e.g., a checksum routine selected from multi-checksum product codes performed by checksum module 122). The result of the checksum routine is the checksum parity output 220 for each row. As mentioned, an example checksum routine is the multi-CRC checksum with different polynomials, although other checksum routines can be employed.
Moreover, a different checksum routine can be performed on each of the different rows. In the example of a multi-CRC checksum, different CRC checksum polynomials (e.g., routines 123) are applied to different rows to enhance the robustness of the results.
[0042] FIG. 7 illustrates an example of single chipkill operation 232 performed on each column of the memory 1 10 using a single parity erasure code, the parity of which is generated by the parity routine 125 (e.g., an XOR operation) from the erasure parity module 124. The result is EP for each column (e.g., stored as parity data 140). FIG. 8 provides an example parity routine 234 performed to generate parity of erasure parity (PEP) 240, such as can be computed based on applying a given parity routine to both data in the row EP and the column EP (e.g., XOR'ing the row EP and column EP).
[0043] FIG. 9 illustrates a PEP product 240 as the result of an EraMC ECC operation. Thus, the PEP product 240 can be analyzed by error pattern detection module 126 and comparison module 128 to identify and correct errors (e.g., by use of erasure decode module 129, Chase decoder module 130, and error miss detection module 132).
[0044] RAID memory using RAID levels 5 and 6 introduce significant RAID overhead. RAID 5 and 6 (illustrated in, e.g., FIGS. 3 and 4, respectively) employ erasure codes (e.g., from erasure parity module 124 of FIG. 1 ) to generate RAID parity, which is the same coding as employed in the scheme described above. Accordingly, by use of the EraMC ECC scheme described herein, the erasure codes (e.g., generated by erasure parity module 124 of FIG. 1 ) can overlap with RAID parity. Accordingly, the EraMC ECC scheme fully utilizes the erasure code in ECC with the parity of the RAID memory architecture to limit redundancies and improve both power and processing efficiencies.
[0045] As an example, single DIMM failure contributes essentially one chipkill in codewords creating a failure of the fourth cartridge A4-N4, as illustrated in FIG. 10. The chipkill can be recovered, for example, by implementing the checksum routine 212 and erasure parity code 232 in accordance with the EraMC ECC scheme described herein and provided in FIGS. 6 and 7, respectively. Accordingly, a comparison operation can be performed (e.g., by comparison module 128) to evaluate checksum parity and PEP and identify errors in the memory (e.g., by error pattern detection module 126 and error miss detection module 132). Once identified, errors can then be corrected using one or more decoding routines (e.g., erasure decode module 129, Chase decoder module 130 or other correcting coding techniques). Thus, in response to a failure of the fourth cartridge A4, application of the EraMC ECC can leverage the data written into cartridges A1 to A3 and A5 to A8 to recover the data from failed cartridge A4, as illustrated in FIG. 1 1 . For a double chipkill situation, the EraMC ECC scheme described herein can use, for example, a Vander Monde matrix on finite fields as the parity routine 125 to generate parity. Thus, this implementation can be applied to a RAID 6 configuration by using RAID 6 parity-generation modules (e.g., parity generation module 120 of FIGS. 1 and 2) for the proposed erasure codes. Consequently, up to two DIMM failures will be similar or equivalent and leveraged to deal with double chipkills.
[0046] FIG. 12 illustrates an example method 250 for an ECC decoding method. The method 250 can be performed on the system 90 described in FIG. 1 . At step 252, checksum parity for each row of data in a memory device (e.g., memory 1 10) is generated by performing a checksum operation by, for example, the checksum module 122 of FIG. 1 . At step 254, erasure parity for each column of data in the memory device is generated by performing a parity operation by, for example, the erasure parity module 124 of FIG. 1 to evaluate parity of EP (PEP) with EP. At step 256, the checksum parity and the erasure parity are evaluated by, for example, the error pattern detection module 126 of FIG. 1 . Thus, the number of column errors and row errors are determined and analyzed. At step 258, a determination is made whether the memory device contains an error. If no errors are detected, then the memory is considered error free. Errors that are detected may be corrected as disclosed herein.
[0047] FIG. 13 provides a table to compare the capabilities of several error correction techniques with the EraMC ECC described herein. The first column lists error conditions common to memory cartridges. The second column indicates the capabilities of Single Parity error correction, followed by Standard ECC, RAID Memory, and the final column listing the error correction capabilities of the EraMC ECC. Each of Single Parity, Standard ECC, and RAID Memory are subject to unreliable results. In other words, an unreliable result indicates that the error cannot be detected or corrected. A detection, on the other hand, indicates that the error exists but the location in the memory structure remains unknown (e.g., a string error). A correct response indicates that the nature and location of the error have both been identified. In comparison with the other error correction techniques, the EraMC ECC is a more robust error correction scheme, providing far greater capability to correct and/or detect an error. In the particular example of a multi-random Sb error condition, application of the EraMC ECC on a memory device allows for a combination of checksum parity, erasure parity, error detection and missdetection, and error correction that is not found in other error correction techniques. As shown in FIG. 13, no technique other than EraMC ECC is capable of achieving error detection and/or correction in a reliable manner.
[0048] FIG. 14 is a block diagram of an example system 300 to perform an EraMC ECC error correction scheme. System 300 may be similar to system 90 of FIGS. 1 and 2, for example. In the embodiment of FIG. 14, system 300 includes a processor 304 and a non-transitory computer readable medium 306. Although the following descriptions refer to a single processor and a single machine-readable storage medium, the descriptions may also apply to a system with multiple processors and multiple machine-readable storage mediums. In such examples, the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.
[0049] Processor 304 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in non-transitory computer readable medium 306. In the particular example shown in FIG. 14, processor 304 may fetch, decode, and execute instructions 308, 310, 312, 314, 316, 318, 320, 322 to perform EraMC ECC error correction. As an alternative or in addition to retrieving and executing instructions, processor 304 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in non-transitory computer readable medium 306. With respect to the executable instruction representations (e.g., boxes) described and shown herein, it should be understood that part or all of the executable instructions and/or electronic circuits included within one box may, in alternate embodiments, be included in a different box shown in the figures or in a different box not shown.
[0050] Non-transitory computer readable medium 306 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, non-transitory computer readable medium 306 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Non-transitory computer readable medium 306 may be disposed within system 300, as shown in FIG. 14. In this situation, the executable instructions may be "installed" on the system 300. Alternatively, non-transitory computer readable medium 306 may be a portable, external or remote storage medium, for example, that allows system 300 to download the instructions from the portable/external/remote storage medium. In this situation, the executable instructions may be part of an "installation package". As described herein, non-transitory computer readable medium 306 may be encoded with executable instructions for performing EraMC ECC error correction.
[0051] Referring to FIG. 14, checksum parity generation instructions 308, when executed by a processor (e.g., processor 304), may cause system 300 to generate checksum parity for a memory 302. Memory 302 may be similar to memory 1 10 of FIGS. 1 and 2, for example. Erasure parity generation instructions 310, when executed by a processor (e.g., processor 304), may cause system 300 to generate erasure parity for a memory 302. Based on results from instructions 308 and 310, memory device error determination instructions 312, when executed by a processor (e.g., processor 304), may cause system 300 to determine to whether memory 302 contains an error. This determination can be the result of row error determination instructions 314 and column error determination instructions 316, which collectively determine the number of row and column errors within memory 302, respectively.
[0052] Based on a determination of an error from instruction 312 that the determined number of row errors is less than the determined number of columns, erasure decoding instructions 318, when executed by a processor (e.g., processor 304), may cause system 300 to perform erasure decoding for memory 302. Alternatively, a determination from instruction 312 that the number of row error and column error is equal may cause system 300 to select other decoding instructions (e.g., Chase decoding instructions) 320, when executed by a processor (e.g., processor 304), to perform Chase decoding for memory 302. Additionally, a determination from instruction 312 that no row errors exist may cause system 300 to perform miss detection identification instructions 322, when executed by a processor (e.g., processor 304), to identify a miss detection in memory 302.
[0053] What have been described above are examples. It is, of course, not possible to describe every conceivable combination of components or methods, but one of ordinary skill in the art will recognize that many further combinations and permutations are possible. Accordingly, the invention is intended to embrace all such alterations, modifications, and variations that fall within the scope of this application, including the appended claims. Additionally, where the disclosure or claims recite "a," "an," "a first," or "another" element, or the equivalent thereof, it should be interpreted to include one or more than one such element, neither requiring nor excluding two or more such elements. As used herein, the term "includes" means includes but not limited to, and the term "including" means including but not limited to. The term "based on" means based at least in part on.

Claims

CLAIMS What is claimed is:
1 . A system comprising:
a plurality of memory of devices to store data for the system;
a parity generation module comprising:
a checksum module to perform a checksum operation on each row of data in each of the plurality of memory devices to provide respective checksum outputs; and an erasure parity module to perform an erasure parity operation on each column of data in each of the plurality of memory devices to provide respective erasure parity outputs,
wherein the parity generation module is to evaluate the checksum outputs and the erasure parity outputs for each of the plurality of memory devices to provide parity data to indicate whether a given memory device of the plurality of memory devices contains an error.
2. The system of claim 1 , the parity generation module further comprising an error pattern detection module to determine the number of row errors based on the checksum operation, and to determine the number of column errors based on the erasure parity operation.
3. The system of claim 2, the parity generation module further comprising a comparison module to determine the number of row errors relative to the number of column errors.
4. The system of claim 3, wherein, if the comparison module determines the number of column errors is greater than the number of row errors, the parity generation module further comprises an erasure decode module to perform erasure decoding on each column of data in each of the plurality of memory devices to correct the error.
5. The system of claim 3, wherein, if the comparison module determines the number of column errors is equal to the number of row errors, the parity generation module further comprising a Chase routine module to perform a Chase routine on each column of data and each row of data in each of the plurality of memory devices to correct the error.
6. The system of claim 3, wherein, if the determined number of row errors is zero, the parity generation module further comprising an erasure decode module to perform erasure decoding on each row of each of the plurality of memory devices in an iterative manner.
7. The system of claim 6, wherein the checksum module is to perform a second checksum operation on each row of data in each of the plurality of memory devices after the erasure decoding to provide respective second checksum outputs, the parity generation module further comprising a miss detection module to identify one or more rows of each of the plurality of memory devices that contains a miss detection based on a comparison of the second checksum output to the corresponding checksum outputs for each respective row.
8. The system of claim 7, wherein, in response to the miss detection module determining a miss detection for the one or more rows, the erasure decode module is to perform the erasure decoding on each of the identified one or more rows.
9. A non-transitory computer readable medium comprising instructions to perform error correction in a plurality of memory devices, the instructions executable by a processor of a system comprising:
generate checksum parity for each row of data in each of the plurality of memory devices by performing a checksum operation;
generate erasure parity for each column of data in each of the plurality of memory devices by performing an erasure parity operation; determine whether a memory device of the plurality of memory devices contains an error based on an evaluation of the checksum parity and the erasure parity.
10. The non-transitory computer readable medium of claim 9, wherein the
determining further comprise:
determining the number of row errors based on the checksum operation; and determining the number of column errors based on the erasure parity operation.
1 1 . The non-transitory computer readable medium of claim 10, wherein the
instructions further comprise:
performing an erasure decoding on each row of data in each of the plurality of memory devices to correct the determined error if the determined number of column errors is greater than the determined number of row errors.
12. The non-transitory computer readable medium of claim 10, wherein the
instructions further comprise:
performing a Chase routine on each column of data and row of data in each of the plurality of memory devices to correct the determined error if the determined number of column errors is equal to the determined number of row errors.
13. The non-transitory computer readable medium of claim 10, wherein if the determined number of row errors is zero, the instructions further comprise:
perform erasure decoding on each row of each of the plurality of memory devices in an iterative manner;
generate a second checksum parity for each row of data in each of the plurality of memory devices by performing a second checksum operation;
compare the second checksum parity to the checksum parity; and
identify the one or more rows of each of the plurality of memory devices that contains a miss detection based on the comparison.
14. The non-transitory computer readable medium of claim 13, wherein the instructions further comprise further performing erasure decoding on the identified one or more rows to correct the error.
15. A method comprising:
generating checksum parity for each row of data in each of a plurality of memory devices by performing a checksum operation;
generating erasure parity for each column of data in each of the plurality of memory devices by performing a parity operation;
evaluating the checksum parity relative to the erasure parity; and
determining whether a memory device of the plurality of memory devices contains an error based on the evaluation.
PCT/US2015/013460 2015-01-29 2015-01-29 Erasure multi-checksum error correction code WO2016122515A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/013460 WO2016122515A1 (en) 2015-01-29 2015-01-29 Erasure multi-checksum error correction code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/013460 WO2016122515A1 (en) 2015-01-29 2015-01-29 Erasure multi-checksum error correction code

Publications (1)

Publication Number Publication Date
WO2016122515A1 true WO2016122515A1 (en) 2016-08-04

Family

ID=56543940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/013460 WO2016122515A1 (en) 2015-01-29 2015-01-29 Erasure multi-checksum error correction code

Country Status (1)

Country Link
WO (1) WO2016122515A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180022014A (en) * 2016-08-23 2018-03-06 에스케이하이닉스 주식회사 Memory device thereof
TWI692759B (en) * 2019-05-15 2020-05-01 瑞昱半導體股份有限公司 Method for simultaneously accessing first dram device and second dram device and associated memory controller
CN114333917A (en) * 2021-12-30 2022-04-12 山东云海国创云计算装备产业创新中心有限公司 Data error correction method, device, equipment and medium based on RDP erasure correction algorithm
US20230231573A1 (en) * 2022-01-19 2023-07-20 Micron Technology, Inc. Iterative error correction in memory systems
WO2023179631A1 (en) * 2022-03-21 2023-09-28 华为技术有限公司 Data error correction method and apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4549298A (en) * 1982-06-29 1985-10-22 Sony Corporation Detecting and correcting errors in digital audio signals
EP0398521A2 (en) * 1989-05-12 1990-11-22 International Business Machines Corporation Memory system
US20060256615A1 (en) * 2005-05-10 2006-11-16 Larson Thane M Horizontal and vertical error correction coding (ECC) system and method
US20080040652A1 (en) * 2005-04-07 2008-02-14 Udo Ausserlechner Memory Error Detection Device and Method for Detecting a Memory Error
US7770096B1 (en) * 2005-02-09 2010-08-03 Cypress Semiconductor Corporation Method of operating a matrix checksum

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4549298A (en) * 1982-06-29 1985-10-22 Sony Corporation Detecting and correcting errors in digital audio signals
EP0398521A2 (en) * 1989-05-12 1990-11-22 International Business Machines Corporation Memory system
US7770096B1 (en) * 2005-02-09 2010-08-03 Cypress Semiconductor Corporation Method of operating a matrix checksum
US20080040652A1 (en) * 2005-04-07 2008-02-14 Udo Ausserlechner Memory Error Detection Device and Method for Detecting a Memory Error
US20060256615A1 (en) * 2005-05-10 2006-11-16 Larson Thane M Horizontal and vertical error correction coding (ECC) system and method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180022014A (en) * 2016-08-23 2018-03-06 에스케이하이닉스 주식회사 Memory device thereof
KR102504178B1 (en) 2016-08-23 2023-02-28 에스케이하이닉스 주식회사 Memory device
TWI692759B (en) * 2019-05-15 2020-05-01 瑞昱半導體股份有限公司 Method for simultaneously accessing first dram device and second dram device and associated memory controller
CN114333917A (en) * 2021-12-30 2022-04-12 山东云海国创云计算装备产业创新中心有限公司 Data error correction method, device, equipment and medium based on RDP erasure correction algorithm
CN114333917B (en) * 2021-12-30 2023-11-03 山东云海国创云计算装备产业创新中心有限公司 RDP erasure algorithm-based data error correction method, device, equipment and medium
US20230231573A1 (en) * 2022-01-19 2023-07-20 Micron Technology, Inc. Iterative error correction in memory systems
US11949428B2 (en) * 2022-01-19 2024-04-02 Micron Technology, Inc. Iterative error correction in memory systems
WO2023179631A1 (en) * 2022-03-21 2023-09-28 华为技术有限公司 Data error correction method and apparatus

Similar Documents

Publication Publication Date Title
US9600365B2 (en) Local erasure codes for data storage
US9772900B2 (en) Tiered ECC single-chip and double-chip Chipkill scheme
US8812935B2 (en) Using a data ECC to detect address corruption
US8621318B1 (en) Nonvolatile memory controller with error detection for concatenated error correction codes
TWI465897B (en) Systems and methods for error checking and correcting for a memory module
US20160011941A1 (en) Enabling efficient recovery from multiple failures together with one latent error in a storage array
US8185800B2 (en) System for error control coding for memories of different types and associated methods
KR101684045B1 (en) Local error detection and global error correction
US8171377B2 (en) System to improve memory reliability and associated methods
US8181094B2 (en) System to improve error correction using variable latency and associated methods
US8352806B2 (en) System to improve memory failure management and associated methods
US10467091B2 (en) Memory module, memory system including the same, and error correcting method thereof
US8621290B2 (en) Memory system that supports probalistic component-failure correction with partial-component sparing
US9058291B2 (en) Multiple erasure correcting codes for storage arrays
JP5623635B2 (en) Method, system and computer program for detecting a bus failure
WO2016122515A1 (en) Erasure multi-checksum error correction code
Mittal et al. A survey of techniques for improving error-resilience of DRAM
US8185801B2 (en) System to improve error code decoding using historical information and associated methods
US20040225944A1 (en) Systems and methods for processing an error correction code word for storage in memory components
US20160139988A1 (en) Memory unit
US9189327B2 (en) Error-correcting code distribution for memory systems
US11265022B2 (en) Memory system and operating method thereof
US20180203625A1 (en) Storage system with multi-dimensional data protection mechanism and method of operation thereof
US20160147598A1 (en) Operating a memory unit
CN112346903B (en) Redundant array decoding system of independent disk for performing built-in self-test

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15880413

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15880413

Country of ref document: EP

Kind code of ref document: A1