WO2016122515A1

WO2016122515A1 - Erasure multi-checksum error correction code

Info

Publication number: WO2016122515A1
Application number: PCT/US2015/013460
Authority: WO
Inventors: Han Wang; Patrick A. Raymond; Raghavan V. Venugopal
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-01-29
Filing date: 2015-01-29
Publication date: 2016-08-04

Abstract

A system includes a plurality of memory of devices to store data. The system further includes a parity generation module including a checksum module to perform a checksum operation on each row of data in each of the plurality of memory devices to provide respective checksum outputs. The parity generation module further comprises an erasure parity module to perform an erasure parity operation on each column of data in each of the plurality of memory devices to provide respective erasure parity outputs. The parity generation module can evaluate the checksum output and the erasure parity output for each of the plurality of memory devices to provide parity data to indicate whether a given memory device of the plurality of memory devices contains an error.

Description

ERASURE MULTI-CHECKSUM ERROR CORRECTION CODE

BACKGROUND

[0001] Advanced error correcting code (ECC) computer memory technology can protect computer memory systems from any single memory chip failure as well as multi- bit errors from any portion of a single memory chip. One example scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This scheme can allow stored data to be reconstructed from memory despite a complete failure of one chip.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] FIG. 1 illustrates an example system to perform an erasure multi- checksum product code (EraMC) error-correcting code (ECC) scheme.

[0003] FIG. 2 illustrates another example system to perform an EraMC ECC scheme.

[0004] FIG. 3 illustrates an example of a redundant array of independent disks (RAID) memory architecture.

[0005] FIG. 4 illustrates another example RAID memory architecture.

[0006] FIG. 5 illustrates an example cache line of multiple cartridges.

[0007] FIG. 6 illustrates an example checksum operation.

[0008] FIG. 7 illustrates an example erasure code operation.

[0009] FIG. 8 illustrates an example parity of erasure parity operation.

[0010] FIG. 9 illustrates an example EraMC ECC scheme operation.

[0011] FIG. 10 illustrates an example RAID memory architecture experiencing a failure.

[0012] FIG. 1 1 illustrates an example cache line of multiple cartridges

experiencing a failure.

[0013] FIG. 12 is a flow diagram illustrating an example method of performing an example EraMC ECC scheme operation. [0014] FIG. 13 illustrates a table providing a comparison of error correction techniques.

[0015] FIG. 14 is a block diagram of an example system to perform an EraMC

ECC scheme.

DETAILED DESCRIPTION

[0016] This disclosure relates to a system and method to provide improved error correction in a memory architecture. Error-correcting code (ECC) can be used to detect and correct internal data corruption variations. More specifically, the ECC scheme explained herein can be described as an Erasure Multi-Checksum Product Code (EraMC) ECC scheme capable of decreasing the number of chips to be activated in error correction, and increasing the error correction capability with a lower computation complexity. This is achieved by combining and comparing checksum and erasure code operations, thereby leveraging redundancies experienced though each operation. As a result, the EraMC ECC scheme described herein achieves better fault-tolerance capability with relatively less computational complexity, fewer memory assets activated with the associated power savings, and a more efficient architecture format for RAID memory. For example, a variety of different error conditions (e.g., single bit errors, double bit errors, 4-bit dynamic random access memory (DRAM), 8-bit DRAM, single long burst, two-random single bit, multi-random single bits) can be detected using this approach and, in turn, corrected in a reliable manner.

[0017] Additionally, as memory cell density increases, the Inter-Cell-lnterference (ICI) may be of greater concern such as, for example, the probability of multiple random small burst errors increases. Some solutions may not be able to correct specific errors due to the nature of the various codes in use. Alternative candidate codes are longer, large-symbol codes that are based on larger finite fields. Thus, computational complexity for error correction services increases leading to a much higher decoding latency, requiring increased power consumption. In contrast, the EraMC ECC scheme disclosed herein exhibits additional robustness and power efficiencies relative to these and other possible error detection and correction techniques. [0018] FIG. 1 illustrates an example of an error correction system 90 that includes a computing module 100 capable of performing the EraMC ECC scheme on a memory 1 10. A module, as described herein, can refer to hardware (e.g., one or more discrete components, circuits or circuit systems), firmware or software, or a combination of hardware, software and/or firmware. For example, computing module 100 can be a general purpose computing platform (e.g., computer, server, a data center or the like), an application specific integrated chip (ASIC), a field-programmable gate array (FPGA) integrated circuit, a computer, server, or a combination of hardware and software.

Computing module 100 and the modules included therein can also be capable of receiving and executing machine readable instructions and data, and implementing processes as described herein, with respect to a given module.

[0019] The memory 1 10 can be a collection of memory cartridges 1 through N of volatile or non-volatile memory devices configured to receive and store data. As an example, the memory 1 10 cartridges 1 -N may be implemented as dynamic random access memory (DRAM), such as synchronous DRAM (SDRAM) including, single data rate (SDR) SDRAM, double data rate (DDR) SDRAM (e.g., DDR2, DDR3, DDR4, or other DDR versions). The memory 1 10 can also include other types of memory (e.g., reduced latency DRAM, and in some examples can include a combination of multiple different types of memory. Computing module 100 can be connected to memory 1 10 via an interface 1 12 that facilitates data flow, for example, an electrical or optical bus, or a wireless communications network. The computing module 100 includes a parity generation module 120 that includes a checksum module 122 and an erasure parity module 124. The parity generation module 120 therefore can provide a checksum code, an erasure code, or both.

[0020] As an example, the checksum module 122 is configured (e.g., hardware and/or machine readable instructions) to perform a checksum operation (e.g., checksum routine) on each row of data in each of the plurality of memory devices. In some examples, the checksum module can include (or be configured to employ) a plurality of different checksum routines 123 that can be selectively applied to different rows. As an example instance, different polynomials can be utilized for different rows to help ensure different types of errors can be detected over time. The checksum module 122 thus can provide respective checksum outputs for each row in each of a plurality of memory devices implemented in the memory 1 10. An example checksum candidate code (e.g., checksum routine) is the cyclic redundancy check (CRC) code, although other checksum routines 123 (e.g., Adler-32, Fletcher's checksum or the like) can additionally or alternatively be employed. The checksum outputs can specify checksum parity and be stored as parity data 140. In some examples, the check sum outputs (e.g., checksum parity) for each row of a given memory structure 1 10 is stored in a specified column of such memory. Additionally, in some examples, the parity data 140 can be stored in the memory 1 10 or, in other examples, a separate memory structure.

[0021] As a further example, the erasure parity module 124 (e.g., hardware and/or machine readable instructions) performs a parity operation on each column of data in each of the plurality of memory devices in the memory 1 10. In some examples, the erasure parity module 124 also performs the parity operation on each row of data in the memory devices of memory 1 10. The erasure parity module 124 thus can provide respective parity outputs for each row and/or column, which can be stored as parity data 140, to provide an indication of parity for each column of each of the memory devices in memory 1 10. The particular parity operation implemented by the erasure parity module 124 can depend on whether the data is stored in the memory 1 10 as single bit or multi- bit data. Thus the type of memory can control the parity operation. As one example, the parity operation for single bit data is an exclusive-OR (XOR) operation, although other parity operations can be utilized, such as disclosed herein.

[0022] The parity generation module 120 is configured to receive outputs from each of the checksum module 122 (e.g., checksum outputs) and erasure parity module 124 (e.g., parity outputs for each column or each column and row), and to perform calculations and determinations thereon. Thus, the parity generation module 120 can analyze the outputs from each of the checksum module 122 and the erasure parity module 124 to evaluate the memory 1 10 and determine if an error is present. Thus, if the parity generation module 120 determines an error is present, the number and location of each error can be analyzed to determine if and what type of further action may be required. Output from the parity generation module 120 can be stored as parity data 140, such as to specify checksum parity and PEP for the memory 1 10 in a table or other data structure. In some examples, the checksum parity and/or PEP is stored in a predefined location of a table in the memory 1 10 itself.

[0023] The EraMC ECC scheme employs erasure multi-checksum product codes from the checksum module 122. A checksum is a small-size block of digital data employed for the purpose of detecting errors which may have been introduced during data transmission or data storage. For example, checksum codes can be applied to a file or other forms of data received from a server. If the computed checksum (e.g., checksum outputs from the checksum module 122) for the current data input matches the stored value of a previously computed checksum, there is a high probability that the data has not been altered or corrupted.

[0024] By way of example, the checksum module 122 can generate outer codes implemented as multi-checksum codes to detect and locate memory devices with errors. The goal of using different and/or multiple checksum codes for different chips is to ensure that, if a chip encounters an undetectable error pattern, the undetectable error pattern can be propagated into other chips for a different form of checksum parity check. Thus, hidden error patterns for a particular chip will not be overlooked by different forms of checksum coding implemented by other chips. As a result, EraMC ECC provides a robust error correction scheme. Additionally, inner codes are

generated by the erasure parity module 124 and implemented as erasure codes to perform error recovery and/or correction. In the EraMC ECC scheme disclosed herein, outer codes and inner codes operate jointly to provide chipkill (e.g., advanced ECC) recovery on the memory 1 10.

[0025] FIG. 2 illustrates another example error correction system 90 that includes a computing module 100 to perform an EraMC ECC scheme on a memory 1 10. For sake of consistency, like reference numbers are utilized in FIG. 2 to identify elements previously introduced with respect to FIG. 1 . Briefly stated, the computing module 100 includes a parity generation module 120, which can be implemented by machine readable instructions executable by a processing resources (e.g., one or more processing units), as hardware or a combination of hardware and software. The parity generation module 120 includes a checksum module 122 and an erasure parity module 124. The parity generation module 120 is configured to process outputs from each of the checksum module 122 and erasure parity module 124, as disclosed with respect to FIG. 1 , for example. The checksum module 122 can evaluate checksum parity correctness following checksum performance of checksum routines 123 from the checksum module 122. Additionally, the erasure parity module 124 can include a parity routine module 125 to evaluate erasure parity following performance from the parity routine 125. As an example, the parity routine 125 can be implemented as a symmetric Boolean function whose value depends on the number of ones in the input vector (e.g., parity routine for two inputs is an Exclusive-OR function). Additionally, the parity generation module 120 can compare the outputs from each of the checksum module 122 and the erasure parity module 124 to evaluate the checksum parity and the PEP to ascertain if an error is present in the memory 1 10, and to implement appropriate corrections.

[0026] As demonstrated in the example of FIG. 2, the parity generation module 120 further includes an error pattern detection module 126 to check the checksum parity and the EP of the data in the memory 1 10. The error pattern detection module 126 thus can ensure a match with the written data, the EP and the checksum parity. The error pattern detection module 126 thus can determine the number of column errors and row errors. The parity generation module 120 can also include a comparison module 128 to compare the determined result from the error pattern detection module 126 of each of the checksum module 122 and the erasure parity module 124 for errors. For example, parity generation module 120 can access the results from comparison module 128 (e.g., stored in parity data 140) and evaluate the number and location of each error to determine if further action is required.

[0027] Depending on the type of error detected (e.g., the number of columns and rows with detected errors), various decoding routines may be performed to correct the errors and/or reevaluate an identified error. In the example of FIG. 2, the parity generation module 120 can include an erasure decoding module 129 and a Chase decoder module 130. Other ECC can also be used, such as Reed-Solomon (RS) codes, Cauchy-Reed-Solomon (CRS) and Vandermonde-RS to name a few. The parity generation module 120 thus can selectively employ a different correction module depending on the compare output provided by the comparison module 128 based on comparing the outputs from the error pattern detection module 126.

[0028] As one example, if the comparison module 128 determines no column errors or row errors are detected, no further action is taken, as the absence of column errors indicates that either no error exists or any error is uncorrectable. In another example, if the comparison module 128 determines that the number of column and row errors is equal, then there are likely multi-random single bit errors. Thus, the parity of generation module 120 invokes the Chase decoder module 130 to perform a Chase routine to correct the detected random single bit errors.

[0029] In yet another example, if the comparison module 128 determines the number of column errors to be much greater than the number of row errors, then a chipkill condition is identified in the compare output. In response to detecting the chipkill condition from the compare output, erasure decoding is performed via an erasure decoding module 129. Alternatively, if the comparison module 128 determines only column errors are detected by the error pattern detection module 126 (e.g., no row errors detected by the checksum module), then there is likely a miss detection (e.g., a failure to detect an existing error). If a miss detection is identified, the erasure decode module 129 performs erasure decoding on each row of the memory devices in an iterative manner.

[0030] Following the erasure decoding step, the checksum parity is re-checked by the checksum module 122 to identify the row that contains a miss detection.

Because the patterns for each check sum code differ, the new checksum operation by checksum module 122 compared to the original checksum will indicate a corresponding miss detection error. In response, an error miss detection module 132 will perform an error miss detection test to identify each row that contains one or more miss detection. For example, the error miss detection module 132 will perform an error miss detection test to evaluate the new output from the checksum (e.g., checksum parity) and the erasure decoding applied to each respective row to identify one or more rows containing a miss detection error. Once a row with the error is identified, the erasure parity module can perform erasure decoding on the identified row. Thus, if column/row errors or miss detections have been identified, the error miss detection module 132 can be invoked to perform a miss detection test following completion of the error correction to ensure no further miss detection exists. However, if the miss detection test determines that there is no miss detection, then the memory is considered error free and no further action is needed.

[0031] One method of ECC to perform error correction is an advanced ECC technique (e.g., also known as chipkill or extended ECC). For instance, advanced ECC is a form of error checking and correcting computer memory technology that protects computer memory systems from any single memory chip failure, as well as multi-bit errors from any portion of a single memory chip. One form of chipkill scatters bits of code ECC words across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of a single chip. By employing chipkill, if a chip fails or has exceeded a threshold of bit errors, a redundant memory chip is used to replace the failed chip. Thus, chipkill is able to correct multiple bits with less overhead than conventional ECC methods. However, chipkill type ECC memory mechanisms have several drawbacks, such that the system requires activation of a large number of chips to provide chipkill level data protection. Increases in the number and complexity of chipkill systems will continue if an unmodified ECC scheme is maintained.

[0032] Current solutions employ a large number of activated memory chips concurrently, leading to huge demands on power consumption. A common result is an over-heated data center, which requires an extensive cooling system and higher costs. By the EraMC ECC scheme disclosed herein, a predetermined plurality of memory devices (e.g., eight memory cartridges) are activated for effective ECC. Thus, far fewer chips are activated than other ECC operations in chipkill memory that may employ. As one example, where 36 chips may be used for 4 DRAM in existing chipkill architecture, 18 chips may be used for 8 DRAM in the EraMC ECC scheme disclosed herein.

[0033] If a cartridge in a memory array fails (e.g., failure of a cartridge 1 -N of memory 1 10 of FIG. 1 ), remaining data on the other cartridges within the memory array can be combined with the parity data (e.g., using the parity routine 125 and explained below with reference to FIG. 7) to reconstruct the missing data. In some examples, the parity routine is an XOR operation performed on a given cartridge's data to calculate parity data for the given cartridges. The resulting parity data is then stored on a redundant cartridge. Should any of the cartridges fail, the contents of the failed cartridge can be reconstructed on a replacement cartridge by subjecting the data from the remaining cartridges to the same parity routine. Thus, if one of the cartridges were to fail, its data could be rebuilt using the parity routine results of the contents of the remaining cartridges. The result of that parity routine calculation yields the damaged cartridge's contents, which are then stored on a remaining cartridge, fully repairing the array of independent cartridges. This same parity routine concept can be applied to larger arrays, using any number of cartridges distributed across any number of memory devices.

[0034] In some examples, the checksum routine breaks the data message into words, each with a fixed number of bits, and then computes the XOR of the words in each column. The result of the XOR is appended to the message as an extra word (e.g., stored in each column of memory based on the computed XOR for each respective row). To check the integrity of a message, the receiver computes the XOR of the words, including the checksum. If the result is not a word with an equal number of zeros, the receiver knows a transmission error has occurred for the data.

[0035] Examples of codes for error correction include Reed-Solomon (RS) codes, which are codes to detect and correct multiple random errors within a memory system. Specialized forms of RS codes, specifically Cauchy-Reed-Solomon (CRS) and Vandermonde-RS, can be used to overcome the unreliable nature of data transmission over erasure channels. Thus, example codes that can be implemented in the EraMC ECC scheme described herein are Cauchy-Reed-Solomon (CRS) codes using XOR operations on Galois Field (GF)(2) instead of GF(2⁸), which serves to reduce

computational complexity. As an alternative example, Sparse Check Matrix (SCM) codes can be implemented on GF(2). This SCM scheme can be readily implemented in hardware solutions in the system 90 by employing non-Galois Field operations for both single chipkill and double chipkill, for example.

[0036] Parity data can be used in a Redundant Array of Independent Disks (RAID) memory architecture to perform error detection and correction, provide additional fault-tolerance capability, and achieve further system protection, schematic examples of which are demonstrated in FIGS. 3 and 4. In these examples, the RAID data storage virtualization technology combines multiple, independent disk cartridge components into a logical unit for the purposes of data redundancy and/or performance improvements. For instance data is distributed across the individual cartridges in one of several ways, referred to as RAID levels, depending on the desired level of redundancy and performance requirements. Examples of RAID architectures can include RAID 0 (striping), RAID 1 and its variants (mirroring), RAID 5 (distributed parity), and RAID 6 (dual parity).

[0037] As an example, RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance on up to two failed cartridges. This makes RAID 6 more practical for high-availability systems, as large-capacity drives take longer to restore. For instance, with a RAID 6 array, it is possible to mitigate most of the problems associated with lower RAID levels. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 over lower RAID levels. A RAID 6, for example, can employ parity, which is an error protection scheme to provide fault tolerance in a given set of data. RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois Field (GF) or Reed-Solomon (RS) error correction. However, employing RAID memory architecture by itself may add an additional 25% storage overhead or more to a given memory system.

[0038] In comparison, the overhead of RAID systems contributing to overall system overhead is quite significant. For an example RAID 5 setup, as illustrated in FIG. 3, the RAID memory has five cartridges arranged as a series of columns, A1 -D1 to A4-D4, and a fifth cartridge populated for parity redundancy. The overhead introduced by this setup is therefore approximately 25%. Each column is divided into rows (e.g., A, B, C, and D). Overhead refers to the processing time required by system software, which includes the operating system and any utility that supports application programs. For example, overhead refers to the processing time required by codes for error checking and control of transmissions. In such a case, the combined scheme involving RAID parity with the EraMC ECC code words, in which the ECC redundancies overlap with RAID redundancies, reduces the RAID memory overhead to approximately zero. For example, the EraMC ECC scheme described herein enables ECC to leverage the redundancies of a RAID architecture to provide error correction that requires less power consumption and fewer computing resources. Accordingly, the EraMC ECC scheme achieves both power and computation efficiency for ECC memory, and format efficiency for a RAID memory architecture.

[0039] As a further example, for a double data rate fourth generation (DDR4) x8 dynamic random-access memory (DRAM) configuration, the burst length is eight, and each cache line has 64-bytes. As illustrated in the example of FIG. 4, the EraMC ECC scheme described herein divides the cache line (e.g., the whole codeword) into N number of elements, each to be written into different memory cartridges for a RAID 6 memory architecture with eight cartridges. For each memory channel, the probability that dual in-line memory module (DIMM) failure and random errors occur (e.g., in a chip on a different DIMM) concurrently for that particular read is therefore significantly lowered. Thus, when a DIMM fails, it is equivalent to a situation where a single chipkill condition occurs.

[0040] Another example of a memory architecture implementing the EraMC ECC scheme is illustrated in FIG. 5, expanding the cache line across each of the eight cartridges, shown as rows A1 to A8. Each cartridge has another eight portions arranged as columns. Columns one through are six denoted by 200 to 205 being 1 byte of data. Column seven is split between 8 bit data 210 and checksum parity (e.g., determined by checksum module 122), shown as 220. Erasure parity (EP) 230 occupies the eighth column of rows A1 -A7. A parity of erasure parity (PEP) at 240 occupies the eighth column of row A8.

[0041] FIGS. 6-9 illustrate the EraMC ECC scheme being performed on a single cache line, such as the cache line illustrated in FIG. 5. For example, the parity generation module 120 in communication with the plurality of memory devices 1 10, as described with reference to FIGS. 1 and 2, can be used to implement the EraMC ECC scheme described in FIGS. 6-9. FIG. 6 illustrates an example checksum routine 212 performed on each row A1 to A8 across columns 200-205 of each cartridge (e.g., a checksum routine selected from multi-checksum product codes performed by checksum module 122). The result of the checksum routine is the checksum parity output 220 for each row. As mentioned, an example checksum routine is the multi-CRC checksum with different polynomials, although other checksum routines can be employed.

Moreover, a different checksum routine can be performed on each of the different rows. In the example of a multi-CRC checksum, different CRC checksum polynomials (e.g., routines 123) are applied to different rows to enhance the robustness of the results.

[0042] FIG. 7 illustrates an example of single chipkill operation 232 performed on each column of the memory 1 10 using a single parity erasure code, the parity of which is generated by the parity routine 125 (e.g., an XOR operation) from the erasure parity module 124. The result is EP for each column (e.g., stored as parity data 140). FIG. 8 provides an example parity routine 234 performed to generate parity of erasure parity (PEP) 240, such as can be computed based on applying a given parity routine to both data in the row EP and the column EP (e.g., XOR'ing the row EP and column EP).

[0043] FIG. 9 illustrates a PEP product 240 as the result of an EraMC ECC operation. Thus, the PEP product 240 can be analyzed by error pattern detection module 126 and comparison module 128 to identify and correct errors (e.g., by use of erasure decode module 129, Chase decoder module 130, and error miss detection module 132).

[0044] RAID memory using RAID levels 5 and 6 introduce significant RAID overhead. RAID 5 and 6 (illustrated in, e.g., FIGS. 3 and 4, respectively) employ erasure codes (e.g., from erasure parity module 124 of FIG. 1 ) to generate RAID parity, which is the same coding as employed in the scheme described above. Accordingly, by use of the EraMC ECC scheme described herein, the erasure codes (e.g., generated by erasure parity module 124 of FIG. 1 ) can overlap with RAID parity. Accordingly, the EraMC ECC scheme fully utilizes the erasure code in ECC with the parity of the RAID memory architecture to limit redundancies and improve both power and processing efficiencies.

[0045] As an example, single DIMM failure contributes essentially one chipkill in codewords creating a failure of the fourth cartridge A4-N4, as illustrated in FIG. 10. The chipkill can be recovered, for example, by implementing the checksum routine 212 and erasure parity code 232 in accordance with the EraMC ECC scheme described herein and provided in FIGS. 6 and 7, respectively. Accordingly, a comparison operation can be performed (e.g., by comparison module 128) to evaluate checksum parity and PEP and identify errors in the memory (e.g., by error pattern detection module 126 and error miss detection module 132). Once identified, errors can then be corrected using one or more decoding routines (e.g., erasure decode module 129, Chase decoder module 130 or other correcting coding techniques). Thus, in response to a failure of the fourth cartridge A4, application of the EraMC ECC can leverage the data written into cartridges A1 to A3 and A5 to A8 to recover the data from failed cartridge A4, as illustrated in FIG. 1 1 . For a double chipkill situation, the EraMC ECC scheme described herein can use, for example, a Vander Monde matrix on finite fields as the parity routine 125 to generate parity. Thus, this implementation can be applied to a RAID 6 configuration by using RAID 6 parity-generation modules (e.g., parity generation module 120 of FIGS. 1 and 2) for the proposed erasure codes. Consequently, up to two DIMM failures will be similar or equivalent and leveraged to deal with double chipkills.

[0046] FIG. 12 illustrates an example method 250 for an ECC decoding method. The method 250 can be performed on the system 90 described in FIG. 1 . At step 252, checksum parity for each row of data in a memory device (e.g., memory 1 10) is generated by performing a checksum operation by, for example, the checksum module 122 of FIG. 1 . At step 254, erasure parity for each column of data in the memory device is generated by performing a parity operation by, for example, the erasure parity module 124 of FIG. 1 to evaluate parity of EP (PEP) with EP. At step 256, the checksum parity and the erasure parity are evaluated by, for example, the error pattern detection module 126 of FIG. 1 . Thus, the number of column errors and row errors are determined and analyzed. At step 258, a determination is made whether the memory device contains an error. If no errors are detected, then the memory is considered error free. Errors that are detected may be corrected as disclosed herein.

[0047] FIG. 13 provides a table to compare the capabilities of several error correction techniques with the EraMC ECC described herein. The first column lists error conditions common to memory cartridges. The second column indicates the capabilities of Single Parity error correction, followed by Standard ECC, RAID Memory, and the final column listing the error correction capabilities of the EraMC ECC. Each of Single Parity, Standard ECC, and RAID Memory are subject to unreliable results. In other words, an unreliable result indicates that the error cannot be detected or corrected. A detection, on the other hand, indicates that the error exists but the location in the memory structure remains unknown (e.g., a string error). A correct response indicates that the nature and location of the error have both been identified. In comparison with the other error correction techniques, the EraMC ECC is a more robust error correction scheme, providing far greater capability to correct and/or detect an error. In the particular example of a multi-random Sb error condition, application of the EraMC ECC on a memory device allows for a combination of checksum parity, erasure parity, error detection and missdetection, and error correction that is not found in other error correction techniques. As shown in FIG. 13, no technique other than EraMC ECC is capable of achieving error detection and/or correction in a reliable manner.

[0048] FIG. 14 is a block diagram of an example system 300 to perform an EraMC ECC error correction scheme. System 300 may be similar to system 90 of FIGS. 1 and 2, for example. In the embodiment of FIG. 14, system 300 includes a processor 304 and a non-transitory computer readable medium 306. Although the following descriptions refer to a single processor and a single machine-readable storage medium, the descriptions may also apply to a system with multiple processors and multiple machine-readable storage mediums. In such examples, the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.

[0049] Processor 304 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in non-transitory computer readable medium 306. In the particular example shown in FIG. 14, processor 304 may fetch, decode, and execute instructions 308, 310, 312, 314, 316, 318, 320, 322 to perform EraMC ECC error correction. As an alternative or in addition to retrieving and executing instructions, processor 304 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in non-transitory computer readable medium 306. With respect to the executable instruction representations (e.g., boxes) described and shown herein, it should be understood that part or all of the executable instructions and/or electronic circuits included within one box may, in alternate embodiments, be included in a different box shown in the figures or in a different box not shown.

[0050] Non-transitory computer readable medium 306 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, non-transitory computer readable medium 306 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Non-transitory computer readable medium 306 may be disposed within system 300, as shown in FIG. 14. In this situation, the executable instructions may be "installed" on the system 300. Alternatively, non-transitory computer readable medium 306 may be a portable, external or remote storage medium, for example, that allows system 300 to download the instructions from the portable/external/remote storage medium. In this situation, the executable instructions may be part of an "installation package". As described herein, non-transitory computer readable medium 306 may be encoded with executable instructions for performing EraMC ECC error correction.

[0051] Referring to FIG. 14, checksum parity generation instructions 308, when executed by a processor (e.g., processor 304), may cause system 300 to generate checksum parity for a memory 302. Memory 302 may be similar to memory 1 10 of FIGS. 1 and 2, for example. Erasure parity generation instructions 310, when executed by a processor (e.g., processor 304), may cause system 300 to generate erasure parity for a memory 302. Based on results from instructions 308 and 310, memory device error determination instructions 312, when executed by a processor (e.g., processor 304), may cause system 300 to determine to whether memory 302 contains an error. This determination can be the result of row error determination instructions 314 and column error determination instructions 316, which collectively determine the number of row and column errors within memory 302, respectively.

[0052] Based on a determination of an error from instruction 312 that the determined number of row errors is less than the determined number of columns, erasure decoding instructions 318, when executed by a processor (e.g., processor 304), may cause system 300 to perform erasure decoding for memory 302. Alternatively, a determination from instruction 312 that the number of row error and column error is equal may cause system 300 to select other decoding instructions (e.g., Chase decoding instructions) 320, when executed by a processor (e.g., processor 304), to perform Chase decoding for memory 302. Additionally, a determination from instruction 312 that no row errors exist may cause system 300 to perform miss detection identification instructions 322, when executed by a processor (e.g., processor 304), to identify a miss detection in memory 302.

[0053] What have been described above are examples. It is, of course, not possible to describe every conceivable combination of components or methods, but one of ordinary skill in the art will recognize that many further combinations and permutations are possible. Accordingly, the invention is intended to embrace all such alterations, modifications, and variations that fall within the scope of this application, including the appended claims. Additionally, where the disclosure or claims recite "a," "an," "a first," or "another" element, or the equivalent thereof, it should be interpreted to include one or more than one such element, neither requiring nor excluding two or more such elements. As used herein, the term "includes" means includes but not limited to, and the term "including" means including but not limited to. The term "based on" means based at least in part on.

Claims

CLAIMS What is claimed is:

1 . A system comprising:

a plurality of memory of devices to store data for the system;

a parity generation module comprising:

a checksum module to perform a checksum operation on each row of data in each of the plurality of memory devices to provide respective checksum outputs; and an erasure parity module to perform an erasure parity operation on each column of data in each of the plurality of memory devices to provide respective erasure parity outputs,

wherein the parity generation module is to evaluate the checksum outputs and the erasure parity outputs for each of the plurality of memory devices to provide parity data to indicate whether a given memory device of the plurality of memory devices contains an error.

2. The system of claim 1 , the parity generation module further comprising an error pattern detection module to determine the number of row errors based on the checksum operation, and to determine the number of column errors based on the erasure parity operation.

3. The system of claim 2, the parity generation module further comprising a comparison module to determine the number of row errors relative to the number of column errors.

4. The system of claim 3, wherein, if the comparison module determines the number of column errors is greater than the number of row errors, the parity generation module further comprises an erasure decode module to perform erasure decoding on each column of data in each of the plurality of memory devices to correct the error.

5. The system of claim 3, wherein, if the comparison module determines the number of column errors is equal to the number of row errors, the parity generation module further comprising a Chase routine module to perform a Chase routine on each column of data and each row of data in each of the plurality of memory devices to correct the error.

6. The system of claim 3, wherein, if the determined number of row errors is zero, the parity generation module further comprising an erasure decode module to perform erasure decoding on each row of each of the plurality of memory devices in an iterative manner.

7. The system of claim 6, wherein the checksum module is to perform a second checksum operation on each row of data in each of the plurality of memory devices after the erasure decoding to provide respective second checksum outputs, the parity generation module further comprising a miss detection module to identify one or more rows of each of the plurality of memory devices that contains a miss detection based on a comparison of the second checksum output to the corresponding checksum outputs for each respective row.

8. The system of claim 7, wherein, in response to the miss detection module determining a miss detection for the one or more rows, the erasure decode module is to perform the erasure decoding on each of the identified one or more rows.

9. A non-transitory computer readable medium comprising instructions to perform error correction in a plurality of memory devices, the instructions executable by a processor of a system comprising:

generate checksum parity for each row of data in each of the plurality of memory devices by performing a checksum operation;

generate erasure parity for each column of data in each of the plurality of memory devices by performing an erasure parity operation; determine whether a memory device of the plurality of memory devices contains an error based on an evaluation of the checksum parity and the erasure parity.

10. The non-transitory computer readable medium of claim 9, wherein the

determining further comprise:

determining the number of row errors based on the checksum operation; and determining the number of column errors based on the erasure parity operation.

1 1 . The non-transitory computer readable medium of claim 10, wherein the

instructions further comprise:

performing an erasure decoding on each row of data in each of the plurality of memory devices to correct the determined error if the determined number of column errors is greater than the determined number of row errors.

12. The non-transitory computer readable medium of claim 10, wherein the

instructions further comprise:

performing a Chase routine on each column of data and row of data in each of the plurality of memory devices to correct the determined error if the determined number of column errors is equal to the determined number of row errors.

13. The non-transitory computer readable medium of claim 10, wherein if the determined number of row errors is zero, the instructions further comprise:

perform erasure decoding on each row of each of the plurality of memory devices in an iterative manner;

generate a second checksum parity for each row of data in each of the plurality of memory devices by performing a second checksum operation;

compare the second checksum parity to the checksum parity; and

identify the one or more rows of each of the plurality of memory devices that contains a miss detection based on the comparison.

14. The non-transitory computer readable medium of claim 13, wherein the instructions further comprise further performing erasure decoding on the identified one or more rows to correct the error.

15. A method comprising:

generating checksum parity for each row of data in each of a plurality of memory devices by performing a checksum operation;

generating erasure parity for each column of data in each of the plurality of memory devices by performing a parity operation;

evaluating the checksum parity relative to the erasure parity; and

determining whether a memory device of the plurality of memory devices contains an error based on the evaluation.