US7895502B2  Error control coding methods for memories with subline accesses  Google Patents
Error control coding methods for memories with subline accesses Download PDFInfo
 Publication number
 US7895502B2 US7895502B2 US11619929 US61992907A US7895502B2 US 7895502 B2 US7895502 B2 US 7895502B2 US 11619929 US11619929 US 11619929 US 61992907 A US61992907 A US 61992907A US 7895502 B2 US7895502 B2 US 7895502B2
 Authority
 US
 Grant status
 Grant
 Patent type
 Prior art keywords
 code
 data
 memory
 error
 level
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Expired  Fee Related, expires
Links
Images
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRICAL DIGITAL DATA PROCESSING
 G06F11/00—Error detection; Error correction; Monitoring
 G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
 G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
 G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
 G06F11/1008—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
 G06F11/1012—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
Abstract
Description
1. Field of the Invention
The present application generally relates to error control coding methods for computer memory systems and, more particularly, to accessing errorcontrol coded data in pieces smaller than one line, yet maintaining a high level of reliability. Such a mode of operation will be called subline access, which has its name derived from computer memory systems, where a cache line has a fixed size of, say, 128 bytes for IBM systems. One motivation to consider such a mode of data access is to increase efficiency; e.g., to allow more concurrent accesses, to reduce contention, latency and to conserve power, albeit depending on the application, there may be many other reasons to consider such an option.
2. Background Description
Codes protect data against errors.
When we read a portion of a codeword, the idea is of course that small pieces of data shall still be protected with certain error detection/correction capabilities. A trivial “solution” to the problem is to have each subline by itself have enough error correction capability as required for the whole codeword in the worst case, as shown, for example, in
In
A standard [19,16] shortened ReedSolomon code constructed on GF(256) may be used as follows. Assign one (byte) symbol to each individual DRAM transfer over its eight I/O channels. The code is applied across the nineteen DRAMs in parallel, and independently on each of the eight transfers. An illustration of this coding scheme can be found in
Now suppose that one desires to access information from the DIMM in a granularity of 64 B (sixtyfour bytes), instead of 128 B. Since at the present time the most common cache line in PowerPC® microprocessors is 128 B, one may say that one desires to make 64 B subline accesses. Reference is made to
Since the number of DRAM devices that are devoted to redundancy is odd (three in this example), we cannot distribute them evenly among the two groups. In this example, Group 1 retains one redundant DRAM, whereas group two is allocated two redundant DRAMs. Now, let us analyze the level of reliability that one obtains by using shortened ReedSolomon codes applied independently on each of the groups. For Group 2, one may employ a [10,8] shortened ReedSolomon code for each of the transfers of the group of DRAMs (as described above for the first setting discussed). This enables the system to correct any error that may arise from that group. On the other hand, Group 1, we can only use a [9,8] shortened ReedSolomon code. It is well known in the art that this code can only detect single symbol errors, and therefore the reliability characteristics of the DRAM devices of Group 1 are such that a single chip error may be detected but not corrected. It is worth noting that using a [18,16] code on the Group 1 transfers by taking two DRAM transfers instead of one does not result in the desired effect of correcting a single chip error because there are potentially up to two errors, and the [18,16] code can only correct up to one error, that is, if 100% error correction is desired. Longer codes applied over larger fractions of a DRAM burst have similar inconveniences.
The above illustrates that accessing smaller amounts of data in a memory in some instances results in a loss of available reliability. In the case of 128 B granularity of access, there is single chip error correction and double error detection, whereas in the case of 64 B granularity of access, a simple application of independent codes results in one of the groups not being able to correct all single chip errors. This is not an artificial result of having selected an odd number for the total number of redundant chips. If one had chosen four chips total, then it is easy to see that the system with 128 B access granularity would be able to do double chip error corrections, whereas 64 B access granularity (with two redundant chips on each group) would only be able to do single chip error correction.
The phenomenon described above is further exacerbated as the desired unit of access becomes smaller. Taking again the example in which a total of four additional redundant chips are given, if the desired unit of access is 32 B, then only one chip is allocated for every 32 B group, and only single chip error detection is attained.
As a result of the discussion above, it is often the case that one chooses to access information in sufficiently large lines so that reliability is not an issue, which in turn is associated with a number of drawbacks. For example, in memories where concurrent requests can be serviced, it may be that fewer such requests can in principle be serviced due to the fact that the larger line results in more resources from the memory being in a busy state. Other drawbacks include increased power consumption, due to the activation of a larger number of resources in the memory, and/or an increased usage of the communication channels that connect the memory with the system that uses it. A recent trend in adding more processing cores in a processor chip strains the buses that connect the processor chip with its memory subsystem and in some instances the result is a trend to design memories with smaller access granularities, with the reliability drawbacks noted above.
The description of the issues above serves as a motivation for this invention, in which we disclose a memory augmented with special error control codes and read/write algorithms to improve upon the problem exposed. In order to maximize the scope of our invention, we also disclose novel error control methods that in some instances result in improved redundancy/reliability tradeoffs. We include a detailed description of the optimality properties that one in general may desire from codes for this application. We phrase our invention using the terminology “line/subline”, where subline is the desired (smaller) common access granularity and line is the access granularity that is used during an error correction stage. The general aspect of the error control coding techniques that we use is that a two level coding structure is applied with a first level for the sublines permitting reliable subline accesses correcting and detecting possible errors up to a prescribed threshold, and then a second level permitting further correction of errors found. We note that in the future what we are currently calling a subline may be referred to as a line in microprocessors and what we call a line will necessitate a different terminology; for example “block of lines”.
It is noted that in the related field of hard drive storage technology a number of inventions have been made that employ error control. The known inventions are listed and discussed below.
In U.S. Pat. No. 4,525,838 for “Multibyte Error Correcting System Involving A TwoLevel Code Structure” by Arvind M. Patel and assigned to IBM, a method is disclosed whereby small data chunks are protected with a first level of code and then multiple such small data chunks are protected using a shared, second level of code. The motivation cited for the invention lies on that conventional coding techniques impose restrictions on the blocklength of the code coming from algebraic considerations of their construction. For example, when the Galois Field that is used to construct the code has cardinality q, it is known that ReedSolomon codes have maximum blocklength q−1, and doubly extended ReedSolomon codes only increase this blocklength by 2. In typical applications q=256, which in the storage application of Patel would in some instances lead to undesirable restrictions.
In U.S. Pat. No. 5,946,328 for “Method and Means for Efficient Error Detection and Correction in Long Byte Strings Using Integrated Interleaved ReedSolomon Codewords” by Cox et al. and assigned to IBM, a method is disclosed whereby a block composed with a plurality of interleaved codewords is such that one of the codewords is constructed through a certain logical sum of the other codewords. The procedure indicated is claimed to further enhance the reliability of the stored data above the reliability levels attained by the patent of Patel U.S. Pat. No. 4,525,838. We note that the error detection/correction procedure is applied to blocks of stored data. This is because the main motivation for this invention is not to provide individual access to codewords of the block but rather to provide for an integrated interleaving of the block that is more efficient that that provided by Patel.
In U.S. Pat. No. 6,275,965 for “Method and Apparatus for Efficient Error Detection and Correction in Long Byte Strings Using Generalized, Integrated Interleaving ReedSolomon Codewords” by Cox et. al. and assigned to IBM, the earlier U.S. Pat. No. 5,946,328 is further augmented with the capability of multilple codewords within a block benefiting from the shared redundancy when their own redundancy is insufficient to correct errors.
In U.S. Pat. No. 6,903,887 for “Multiple Level (ML), Integrated Sector Format (ISF), Error Correction Code (ECC) Encoding and Decoding Processes for Data Storage or Communication Devices and Systems” by Asano et al. and assigned to IBM, the idea of an integrated interleave in a sector is further extended with multiple levels of code to cover integrated sectors. We note that a change in terminology as come into effect in this patent whereby what was previously called a block in earlier patents is now identified with a sector together with its redundant checkbytes, and a group of sectors is now called a block. Using the new terminology, a notable aspect of the invention in discussion is that the basic unit of access of this storage memory is a sector (typically 512 bytes), and not the block of sectors to which the shared redundancy is applied, which differs from the previous cited inventions. This feature creates an issue with writing individual sectors to the storage device, the main cited problem being that such individual sector write operations need to be preceded by a read operation that reads the other sectors participating in the overall block, followed by an encoding and writing of the entire block back to the storage. This is referred to as the “ReadModifyWrite” (RMW) problem and is highlighted as an undesirable problem that can potentially reduce the performance of hard disks. The Asano et al. patent addresses this problem through its multiple levels of coding whereby in some instances protection by higher levels is disabled but a certain level of reliability is maintained by the lower levels of coding (for example, by coding within the sector as discussed by earlier patents). Another aspect of the Asano et al. patent is that redundant check bytes computed for a block are computed using only certain summations of check bytes at the sectorlevel (as opposed to the actual data contents of sectors), which is cited as a key property that enables high performance drive performance by avoiding the need to have the entire sector data present during the check computations.
As we shall see, our invention's preferred embodiment is concerned with memories that are used as the first main level of storage in a computer system, although they are also applicable to microprocessor caches and other settings. As such, distinct considerations are of the essence. In one aspect of this invention beyond those already stated, our coding techniques enable a memory with the capacity of executing an efficient ReadModifyWrite operation. In another aspect of this invention, novel error control coding techniques are disclosed that have the desirable property that the minimum distance of the second level code can exceed twice the minimum distance of the first level code yet pay the smallest possible theoretical cost in terms of allocated redundant resources (the minimum distance of a code is a technical term that is often used to describe an important aspect of the error correction capacity of a code). In a third aspect of this invention, subline accesses are employed during common system operation but line accesses are employed during a process that is commonly known as “memory scrubbing”, whereby a background system process periodically scans the memory to read and write back the contents, thereby preventing the accumulation of errors in the memory.
According to the present invention, we provide a twolevel error control protocol, in which errors are detected on the subline level through private redundancy, and corrected using the overall redundancy for the entire line. For example, a system may normally read small pieces of coded data and only checks for errors before accepting them, and in case errors are detected, the whole codeword will be read for error correction. This makes a lot of sense for memory systems, for example, where errors are extremely rare and persistent errors are logged and usually serviced in a limited amount of time. In another similar example, the system may, upon detection of errors in a small piece of data, enter a more “cautious” mode with longer latency, when data is decoded only after a full codeword is received; and after detection of no errors for a certain amount of time, revert back to the more “aggressive” state to read data in subline mode. This would make sense for a communication link with distinctive “good” and “bad” states and strong temporal correlation.
One of our primary concerns is that of achieving the best tradeoff between “local” and “global” error control capabilities and the overall overhead (redundancy) of the code. For simplicity, we focus on random errors and guaranteed error correction, and therefore focus on minimum (Hamming) distances and algebraic constructions.
The foregoing and other objects, aspects and advantages will be better understood from the following detailed description of a preferred embodiment of the invention with reference to the drawings, in which:
In order to describe our preferred embodiment, we will rely on the discussion in the background section above, we will maintain our assumption that our memory system is built using discrete DRAM devices, packaged as DIMMs, and where each DRAM is a ×8, burst 8 device. As previously stated, for the design shown in
Instead of using two independent codes, each applied to Group 1 and Group 2, we will employ an error correcting code whose parity check matrix can be written as H=[H_{1,d }H_{1,r }H_{2,d }H_{2,r}], as illustrated in
The reader conversant with the theory of error control codes will recognize that the matrix above is presented in nonsystematic form in that it is not immediately evident from the structure of the matrix on how to select the redundant symbols given the information that one wishes to store in the memory, so that the standard parity check equation H_{c}=0 is satisfied. In order to facilitate our discussion, the codeword c will sometimes be written as (c_{1,1 }c_{1,2 }c_{1,3}, . . . , c_{1,9 }c_{2,1 }c_{2,2 }c_{2,3}, . . . , c_{2,9}, c_{2,10}), where the notation makes explicit when one symbol belongs to Group 1 or Group 2. Additionally, we will assume that c_{1,9 }is the redundant symbol for Group 1 and (c_{2,9},c_{2,10}) are the redundant symbols for Group 2.
It is obvious from the equation H_{c}=0 that the redundant symbol for Group 1 can be immediately computed from the associated information symbols using
We shall say that this redundant symbol is private, in light of the fact that its value is completely determined by the group's user data symbols. After this computation, it is possible to compute c_{2,9 }and c_{2,10 }as follows. Let b denote all but the last two elements of c (i.e., we are excluding c_{2,9 }and c_{2,10}). The value of b is completely known at this point, since we know both the information symbols for the two groups as well as the redundant symbol for the first group. One can easily see that
The above represents three linear equations, but the first one is the trival 0=0. It can be checked that the other two equations have a unique solution, which can be easily obtained using well known techniques. Since the values for these two redundant symbols in principle depend on every user data symbol in both groups, we shall call these shared redundant symbols. This completes the discussion on how to compute the three redundant symbols given the pair of 8 B data chunks.
A code is a set of symbol vectors all of the same length. Symbol vectors that belong to a code are called codewords. When one reads all of the symbols coming from one group (but not anything else), the resulting vector is said to be a codeword of a minor code (assuming that there were no errors in the data and redundancy retrieved). Different groups may be associated with different minor codes. In fact, in this example, both minor codes are distinct as their blocklengths clearly differ (one of them has blocklength 9, the other one has blocklength 10). For brevity, one shall refer to the ensemble of minor codes associated with all of the sublines in one line as the “first level code”. Similarly, if one reads all of the symbols of all of the groups the resulting vector is a valid codeword of the “second level code”. This code is obviously defined by the entire parity check matrix H. We will also say that the first level code provides local protection (to the each of the sublines) whereas the second level code provides global protection to the entire line.
It can be shown that either minor code has minimum distance 2, and that the overall code has minimum distance 3. Accordingly, when applying this code to the memory storage problem that we are discussing, we are able to do single chip error detection when either of the two groups are retrieved (but not the other), and single chip error correction when both groups are retrieved.
The above findings motivate the following twolevel protocol for a subline read operation:
a) read only the information of the corresponding group,
b) check for the presence of errors in the group by using the first level code, and
c) if an error is found, read the rest of the codeword and attempt correction of the error using the second level code.
Errors in memories can be categorized according to their frequency of incidence. Errors which are the result of permanent physical damage to some part of memory (for example, an I/O signal driver) will occur with such a high frequency that a system can recognize them and take appropriate actions to remove them from the system. One technique for doing that is to copy the affected data (which is being corrected by error control codes) to a spare DRAM chip. Another technique is simply to replace the failing DIMM through external intervention.
As a result of the above, during normal system operation errors are very rare. As a consequence of this, the twolevel protocol that we have disclosed in this invention results in the vast majority of reads being successfully serviced using subline operations, which is one of the stated goals for this invention. An illustration of the fact that this memory can serve two concurrent read operations at different addresses can be found in
Memory systems are sometimes enhanced with a technique called “memory scrubbing” where periodically memory is scanned, reading the memory contents and correcting (and writing back the corrected data) if errors are found in the memory. This is done to prevent the accumulation of errors in the system. An aspect of this invention is that scrubbing is not done using subline operations, but rather line operations. In particular, whole lines are interpreted with respect to the redundancy of the second level code to test for errors which is advantageous in instances in which the error correction strength of the second level error control code is much superior to the available error detection capacity of the first level code. For example, a double error may exist in a subline which the first level code cannot detect, but which can be detected and corrected in using the second level redundancy.
Another policy that may be practiced with this memory is to not only do error detection but also attempt some degree of error correction using the first level code (as done in some of the prior art in the hard disk storage technology application). For example, an extended Hamming code may be used as a first level code in order to do single bit error correction using the first level code, and simultaneously, detect double bit errors, but rely on the second level code to correct for double bit errors.
We now turn our attention to the problem of performing write operations. The essential problem here is that a change to even one symbol in any group can affect redundant symbols in both groups. As a result, write operations are more complex than read operations. Nevertheless, when one takes into account typical ratios of the number of read operations to write operations, overall system operation is still improved by this invention.
We first discuss the problem of executing a readmodifywrite (RMW) on the memory as this becomes the basis of understanding the write problem and is also of independent interest. To do this, suppose that we request a RMW to the second group (the one that has two redundant checks). In this case no changes to the first group are needed whatsoever.
The contents of the memory before the RMW request are denoted by (c_{1,1 }c_{1,2 }. . . c_{1,8 }c_{1,9 }c_{2,1 }c_{2,2 }. . . c_{2,8 }c_{2,9 }c_{2,10}). The contents of the memory after the RMW request are denoted by (c_{1,1 }c_{1,2 }. . . c_{1,8 }c_{1,9 }a_{2,1 }a_{2,2 }. . . a_{2,8 }a_{2,9 }a_{2,10}). Our task is to select (a_{2,9 }a_{2,10}) by knowing only (c_{2,1 }c_{2,2 }. . . c_{2,8 }c_{2,9 }c_{2,10}), which is retrieved during the read operation, and (a_{2,1 }a_{2,2 }. . . a_{2,8}), the new data that will be stored in the memory. The following holds for the old data:
We want the following to hold for the new data:
Note that in both equations the expression in the second parenthesis is identical. From these equations, we can obtain the formula:
This formula represents three equations, but the first one is the trivial 0=0. The other two equations can be easily solved for a_{2,9 }and a_{2,10 }which are the new values for the old symbols c_{2,9}, c_{2,10}. We note that the fact that these symbols are shared redundancy does not imply that all of the data symbols that participate in their calculation need to be retrieved in order to update them. This is because some symbols (in this case those of Group 1) do not change their value, and one can use the technique described above to avoid their retrieval.
It is interesting to note that it is in principle possible to simply send the “differential update”
to the memory devices and have the memory devices compute internally the new value of the redundant symbols if the memory is structured to be able to do this. Under these conditions, we would not need to physically convey (c_{2,9}, c_{2,10}) to a separate location to compute the new redundant symbol values. While this observation may seem trivial, basic memory technology improvements to support these general types of operations may be important to support error handling schemes like the one described in this invention (we will say more about this when we discuss operations to Group 2).
Another point worth noting is that it is not necessary to transport the old values c_{2,1}, . . . , c_{2,8 }physically outside of the memory devices if we know beforehand the value of the difference (c_{2,1}a_{2,1}, . . . , c_{2,8}a_{2,8}), and if the memory is equipped to do differential updates as described above. Sending the difference to the memory and indicating the fact that the update should be differential would have the desired effect.
In some instances it is possible to construct the Galois Field so that the “+” operation is simply a logical XOR operation. Thus, the differential updates alluded to above are particularly easy to accomplish; we shall refer to a memory which allows its information contents to be updated by receiving a pattern of bits and XORing the received pattern of bits with a chosen data content shall be referred to as an XOR update memory.
This demonstrates that the RMW operation to Group 2 can be performed using strictly subline operations.
We now discuss how to do a RMW operation to Group 1. In this case, changing any of the eight data symbols from Group 1 requires an update of not only the redundant symbol for Group 1, but also the redundant symbols for Group 2. Using a similar notation as before, the contents of the memory after the RMW request are denoted by (a_{1,1 }a_{1,2 }. . . a_{1,8 }a_{1,9 }c_{2,1 }c_{2,2 }. . . a_{2,9 }a_{2,10}). In this case, the following equation must be satisfied via proper selection of (a_{1,9}, a_{2,9}, a_{2,10})
There are two ways of accomplishing this. The first way is to simply read c_{2,1}, c_{2,2}, . . . , c_{2,8 }and encode the standard way, since now we would have access to all 16 B of user information that we wish to store in the memory. In the new method according to this invention, we make use of the fact that
The 3×3 matrix premultiplying the first vector can be inverted to solve for the differential updates required for the redundant symbols, using only the difference between the new and old values of the memory contents in Group 1. Thus, in reality it is not needed to retrieve c_{2,1}, c_{2,2}, . . . , c_{2,8 }at all. Retrieving c_{2,9 }and c_{2,10 }from the second group (in addition to retrieving all of the information from the first group) is in fact sufficient (such a retrieval pattern would be allowed by a special DIMM configuration described at the beginning of this section). This technique for updating shared redundant symbols during a write can be generally extended and will be referred to as a differential write in subsequent discussions. The key property is that stored data that is not changing in value need not be read in order to update the shared redundancy. Furthermore, the availability of an XOR update memory device would facilitate even more the update of the redundant symbols in Group 2 and/or the actual new data and redundant symbols for Group 1.
This finishes the discussion on how to make subline RMW operations in Groups 1 and 2. As for the write operation, if the old value of the memory contents is not known (which would convert the write effectively into an RMW for the purposes of our coding discussion), then in one implementation every subline write needs to be preceded by a subline read from the memory. Thus, a write operation effectively becomes a RMW and when the differential RMW technique described above is used, we shall call the write a differential write.
The methods for reading and writing a memory in sublines (and yet maintaining a high level of reliability) are extensible to significantly more general situations than described above. In particular, the number of groups that share redundant symbols can be larger than two. In the example given, the redundant symbol in Group 1 is said to be private in that its value can be set solely by knowing only the contents of Group 1, and the redundant symbols in Group 2 are said to be shared due to the fact that changes in either group need to be reflected in the changes of those redundant symbols in order to maintain a valid overall codeword.
More generally, in a general error control coding scheme for two or more groups, a redundant symbol in a group is said to be private if its contents are completely determined by the contents of the user data stored in the associated group, and shared otherwise. We note that the shared label does not imply that the shared symbols can only be interpreted when reading from the memory all of the information that affects those shared symbols. To further illustrate this point we note that the shared symbols in Group 2 can be read together with the rest of the symbols in Group 2 and the result is a code with minimum distance 2, which suffices to do single symbol error detection; yet we did not have to read all of the information of the two groups in order for the shared symbols to be used effectively for error detection.
Generally, one may have any number of groups assigned shared symbols. For example, in a symmetric situation all of the groups may have the same number of redundant symbols and all of them can be designated as shared. The write operation in this case also becomes symmetric in that a change to any group in general results in updates needed at all of the shared redundant symbols.
We next present an extension of the differential RMW technique (which is the foundation for the differential write technique) to multiple groups. We choose for the error control code in this section a parity check matrix that is derived and extensively justified below as Construction 1 and Theorem 6, to which the reader is referred.
For the parity check matrix H_{b}, we choose a ReedSolomon code. This ensures that the codes C_{a }and C_{b }as described in Construction 1 and Theorem 6 referred to above are MaximumDistanceSeparable, which as we show in a separate section, translates into a certain desirable optimality property for the overall twolevel coding structure. The choice for H_{b }as a ReedSolomon code comes with the restriction that n, which is the total number of symbols of data stored in a line plus the number of redundant symbols (both for the first level code and the second level code), must be less than q, the cardinality of Galois Field used to construct the code. The total amount of information that may be stored on this memory will then depend on the desired reliability parameters d_{a},d_{b }and the desired number of groups M, along with the amount of user data that needs to be stored in any given group, which will be denoted by k (and is counted in the unit of Galois Field symbols), with the exception of the last group as we shall see.
The construction above does not specify how which coordinates of a codeword belong to data versus redundancy; in fact it even allows for different minor code blocklengths. We shall assume that the last group (indexed by M) will not contain any user data and rather will contain all of the shared redundant symbols; correspondingly we shall call it the shared redundancy group. Correspondingly, the blocklength of the M^{th }group needs to be adjusted so that it is equal to d_{b}−1. As for the first M−1 groups, within each group the total blocklength will be equal to d_{a}+k−1, where as previously stated, k is the number of user data symbols that one desires to store in each of the groups.
In order to describe the differential RMW technique, we write
H _{ba} =[H _{ba} ^{(1) } H _{ba} ^{(2, . . . , M−1) } H _{ba} ^{(M)}]
where H_{ba} ^{(1) }denotes the first d_{a}+k−1 columns of H_{ba }and H_{ba} ^{(M) }denotes the last d_{b}−1 columns of H_{ba}. Now suppose a vector c is stored at the memory (c denotes the contents of a line together with its first and second level redundant checks). We first assume that c does not have any errors and thus the parity check equation Hc=0 holds. We assume that we desire to update the contents of the first group, without any loss of generality. To this effect, we partition
c=[c _{1 } c _{2, . . . , M−1 } c _{M}]^{T }
where our partition for c coincides with the partition defined for H_{b}. The new memory contents are denoted by
[a _{1 } c _{2, . . . , M−1 } a _{M}]^{T }
where a_{1 }denotes the new codeword from the first minor code (which itself consists on the new contents of the user data in the first group augmented with valid first level code (or equivalently minor code) redundant check symbols. We note that the computation of such first level code check symbols is a straightforward exercise for those skilled in the field of error control coding. In the above, a_{M }denotes the new values for the shared redundant check symbols; it is the value of these that we need to compute.
As taught by the previous example, since both vectors above are valid codewords of the code H, the following must be satisfied:
H _{ba} ^{(1)} c _{1} +H _{ba} ^{(2, . . . , M−1)} c _{2, . . . , M−1} +H _{ba} ^{(M)} c _{M}=0
H _{ba} ^{(1)} a _{1} +H _{ba} ^{(2, . . . , M−1)} c _{2, . . . , M−1} +H _{ba} ^{(M)} a _{M}=0
In addition, the M^{th }group's minor code parity check equation states that
H _{a} ^{(M)} a _{M}=0
From these, it is easy to see that
The matrix in the left is a square matrix with dimensions (d_{b}−1)×(d_{b}−1). By our choice of construction it can be shown that this matrix is invertible, and hence a unique choice for α_{M }exists and can be computed in this manner. We note that in order to execute the above computation it was unnecessary to know c_{2, . . . , M−1 }which shows the key differential RMW property. The formula thus derived for α_{M }is termed the differential RMW formula. When M is significantly large, the procedure above can result (for example) in significant energy savings by reading and writing sublines instead of entire lines. The write operation is obtained by converting it to a RMW as described earlier. We note that in the construction above the M^{th }group may also be read in isolation with respect to the rest of the groups and errors may be detected, up to the prescribed minimum distance d_{a}.
This is an important feature which allows us to handle errors that may arise in the shared redundancy group. With this in mind, a subline RMW operation for this setting may be summarized as follows:
1) Read the group containing the subline to be modified.
2) Read the shared redundancy group.
3) Determine whether there are errors in either of the two groups read above. If there are errors, read the entire line (together with its first and second level redundancy) and correct the errors. Update the subline and store back the entire line.
4) If there are no errors, compute the new values for the shared redundancy group using the differential RMW formula and modify the subline and the shared redundancy group in the memory.
Suppose that the special group that holds the shared redundant symbols are physically implemented using a memory that allows for multiple concurrent read/write operations. For simplicity, let us assume that the memory where the actual (nonredundant) information is stored only allows for one read or write operation per group. Then in principle (assuming a sufficiently high level of concurrency for the shared redundancy memory), one may have a number of writes to different groups being executed at any given time, with the resulting updates to the shared redundant symbols supported by the assumed concurrency capability. Moreover, multiple reads can also be executed to the groups that are not performing writes since the private redundancies are not affected by the write operations.
We note in the next section that prior published twolevel coding structures satisfy certain desirable optimality properties only under restricted scenarios. Therefore, this invention discloses new error control codes (as for example evidenced by construction 1 above) as well as their application to the present problem of subline access. Also in construction 2 we demonstrate a technique that in certain instances allows us to obtain optimum codes for choices of parameters for which construction 1 cannot yield any codes; in particular the restriction that states that the field size in construction 1 needs to be sufficiently large is addressed.
A generally accepted way of describing an error control code is to show its parity check matrix H. This is because such a matrix identifies precisely which vectors are valid codewords of the code; that is, a vector c is a valid codeword if the linear equation H_{c}=0 is satisfied. In interpreting this invention, we note that a given error control code in principle admits multiple parity check matrices as a characterization of its valid codewords. In particular, multiplying any given row of the matrix by a common factor or adding one row of the matrix H to another one does not change whatsoever the set of vectors that are valid codewords of a code. When one matrix H′ can be obtained from another matrix H via one or more row multiplications and/or row additions, we say that the two check matrices are equivalent.
With this in light of our invention, we will disclose novel error control codes through one particular form for their check matrices, with the understanding that the same code may be described by an equivalent check matrix. Such alternative representation of the code does not represent an essential departure from the present invention. A concept that we will use in our description is that of Maximum Distance Separable (MDS) codes, which is a technical term well known in the art of error control coding. A Maximum Distance Separable code is one which is optimum in the sense of attaining the maximum possible code minimum distance given the number of redundant check symbols. It is known that any ReedSolomon code is Maximum Distance Separable.
It is also noted that the present application of two level coding structures need not be restricted to a problem of storage. In particular, communication systems may also be benefited by such structures, whereby for example high speed transmission of a data stream may be attained by performing only local error detection on what we currently term sublines (which say would be a section of a transmission bus) and then performing error correction in a secondary, slower mode. The above would have the benefit of employing low complexity and low latency coding structures during most of the link's operation (this is assuming a relatively clean link) and resorting to more complex and time consuming procedures only in special circumstances.
Let C be a code of length n (not necessarily linear) over an alphabet of size q. The dimension of C is k=log_{q}C, and the redundancy of C is n−k. The minimum (Hamming) distance of C is
We say that C is a [n,k] code, or, a [n,k,d] code, as a handy reference to its basic parameters. Each codeword of C consists of n symbols. Let them be indexed by 1, 2, . . . , n. Denote [n]={1, 2, . . . , n}. Given I_{i}⊂[n], i=1, . . . , M, let M minor codes, C_{i}=C(I_{i}), i=1, . . . , M, be defined as the projection of C onto the coordinates in I_{i}. Let n_{i}=I_{i}, and denote I_{i}={l_{1} ^{(i)}, l_{2} ^{(i)}, . . . , l_{n} _{ i } ^{(i)}}, such that l_{1} ^{(i)}<l_{2} ^{(i)}< . . . <l_{n} _{ i } ^{(i)}. Then the above definition is to say that c ^{(i)}=(c_{1} ^{(i)}, . . . , c_{n} _{ i } ^{(i)})∈C_{i }if and only if there exists c=(c_{1}, . . . , c_{n})∈C, such that c_{j} ^{(i)}=c_{I} _{ j } _{ (i) }for all l_{j} ^{(i)}∈I_{i}. We denote the minimum distance of C_{i }by d_{i}. Sometimes, we refer to a codeword of C as a line, and a codeword of C_{i }as a subline^{a}. Thus, C_{i }will also be called a subline code. In this section, we assume that all vectors are row vectors if labeled without the “T” subscript (which stands for transpose). Thus, a vector a is a row vector, and the vector a^{T }is a column vector (i.e., transposed version of vector a). This convention is different from that adopted in the earlier section of this description. ^{a }To be exact, lines and sublines indeed refer to the uncoded data that are embedded in the corresponding codewords. We choose to use these terms to refer to the corresponding codewords as well for convenience. The meanings of these terms will become clear from the contexts within which they are used.
Given the definition above, the question is why not one level of coding?
Theorem 1. If there exists a permutation π: [M]
[M], such that for all i, I_{π(i)}, ∪_{j<i}I_{π(j)}≧d_{π(i)}−1, then
Proof. Consider the following experiment. Take a codeword of C, erase d_{1}−1 symbols from those indexed by I_{1}, d_{2}−1 from I_{2}, . . . , d_{M}−1 from I_{M}. This is possible as long as I_{i}, ∪_{j<i}I_{j}≧d_{i}−1 for all i. Now, by definition, the d_{1}−1 erasures from I_{1 }can be corrected using C_{1}, after which the d_{2}−1 erasures from I_{2 }can be corrected using C_{2}, etc., until all erasures are corrected. By the Singleton bound, we must have Σ_{i}(d_{i}−1)≦r. Finally, note that if the condition I_{i}, Å_{j<i}I_{j}≧d_{i}−1 is not satisfied for all i, we can try any relabelling of the I_{i}'s and as long as one works, we arrive at the same inequality.
The message from Theorem 1 is that unless the minor codes have rather significant overlaps, the sum of minimum distances will be bounded by what is realizable using disjoint minor codes (disjoint in terms of I_{i}'s). A natural question is whether by employing moderate overlaps among the minor codes, we can gain enough in the minor codes' minimum distances so that subline access is economical without going to a twolevel coding scheme.
Let us formulate the problem. Suppose all minor codes are to have the same length n_{i}=n_{1}, and the same minimum distance d_{i}=d_{1}. WLOG, assume ∪_{i}I_{i}=[n]. In the regime of Theorem 1, we have d_{1}≦r/M+1. We know for sure that with enough overlaps we potentially are constrained only by d_{1}≦r+1. Depending on how ambitious we are in getting larger and larger d_{1}, we would like to know what this has to imply about n_{1}.
Theorem 2. If d_{1}>r/m+1, where m∈{2, 3, . . . , M}, then
Proof. Consider the same experiment we did in the proof of Theorem 1, in which we erase min{d_{1}−1, I_{i}, Å_{j<i}I_{j}} symbols from coordinates indexed by I_{i}, ∪_{j<i}I_{j}. Such an erasure pattern can be decoded successively using C_{1}, C_{2}, . . . , C_{M}. This implies in general, that
For ease of notation, define Ĩ_{i}=I_{i}, ∪_{j<i}I_{j}. Let A={j:Ĩ_{j}≧d_{1}−1}, and m′=A. We have
which implies m′<m. Now, noting that {Ĩ_{i}}_{i=1} ^{M }is a set partition of [n], we have
Hence,
The result of Theorem 2 can be rephrased as in the following corollary.
Corollary 1. If d_{1}>r/M+1, then
The moral of Theorem 2 is that overlapping does not help. Suppose r/m+1<d_{1}≦r/(m−1)+1. Then ignoring integer effects, the lower bound of Theorem 2 is achieved by coding m−1 disjoint minor codes, each with redundancy r/(m−1) (assuming that the field size is large enough). We then get up to M minor codes by duplicating any of the m−1 codes M−m+1 times.
An alternative interpretation can be as follows. Again, suppose r/m+1<d_{1}≦r/(m−1)+1. We have
In other words (also evident from Corollary 1),
This shows that for given line size (n) and total redundancy (r), there is a tradeoff between subline code length (n_{1}) and subline minimum distance (d_{1}). The tradeoff is such that at least in the bound, the “effective redundancy” for a subline (i.e., d_{1}−1) grows at most proportionally to n_{1}. That is, to get twice the effective redundancy, the subline code has be at least twice as long. This implies that we can always do as well using minor codes coded individually on disjoint sets.
The conclusion of this section is that economical subline access is quite hopeless for a single “level” of codes, without incurring at least the same overhead penalty as required for a naïve solution.
This motivates the need of a change of paradigm, namely, to use a twolevel coding scheme, in which the overall overhead is reduced by reducing d_{1 }to just enough for error detection and having some global redundancy shared by all the sublines for error correction when needed. The vision is that errors are rare (as they are in memory systems), so in most of the time (when no errors are detected) subline access is possible, which will have the benefits we have outlined earlier.
In a twolevel coding setup, we are interested in the tradeoff between r, d, and d_{i}, i=1, . . . , M. For simplicity, suppose d_{i}=d_{1 }for all i, and all minor codes are disjoint. We have the following theorem.
Theorem 3. If d≦max_{i}n_{i}+1, then
r≧d−1+M(d _{1}−1). (1)
Proof. Since d≦max_{i}n_{i}+1, in the procedure we described in the proof of Theorem 1, we can erase d−1 symbols in the last step instead.
The above bound can be generalized.
Theorem 4.
Proof. Similar to the proof of Theorem 3, but note that we in general can erase (d_{1}−1) symbols each in some (M−└(d−1)/n_{1}┘−1) subline codes, n_{1 }symbols each in some other └(d−1)/n_{1}┘ subline codes, and a choice of either (d_{1}−1) or ((d−1) mod n_{1}) symbols in the remaining subline code.
When we are not limited by size of the alphabet over which the code is defined, then Theorem 3 and Theorem 4 are the relevant bounds. When the alphabet size is small, however, other bounds become significant, for example, a spherepacking type of bound as given in the following.
Suppose C can correct up to t errors and each subline code C_{i }can correct t_{i }errors. For simplicity, assume t_{i}=t_{1 }or all i.
Theorem 5. We have
where
is the volume of a Hamming ball in GF(q)^{n }with radius t, and P_{M }(i) is the set of Mway ordered integer partitions of i whose components are all less than or equal to t_{1}.
Proof. The exclusive region around any codeword of C is at least a radiust Hamming ball plus some additional regions that are at Hamming distance larger than t from the codeword but must be exclusive from similar regions near any other codeword because of the error correction capability of the subline codes.
It turns out (1) can be achieved quite easily if the field size over which C is defined is large enough. One such construction is as follows. BTW, we shall always require that ∪_{i}I_{i}=[n], so that the shared redundancy is also protected by the minor codes. And WLOG, we will assume that I_{i}={Σ_{j<i}n_{j}+1, . . . , Σ_{j≦i}n_{j}}. Whenever we speak of a paritycheck matrix, unless otherwise noted, we assume one that has full row rank.
Construction 1. Let C_{a }and C_{b}⊂C_{a }be linear codes of length n and minimum distances d_{a }and d_{b}, respectively. Let H_{a }be a paritycheck matrix for C_{a}, and
be a paritycheck matrix for C_{b}. If we write
H _{a} =[H _{a} ^{(1) } H _{a} ^{(2) } . . . H _{a} ^{(M)}],
where H_{a} ^{(1) }contains the first n_{1 }columns, H_{a} ^{(2) }the next n_{2 }columns, etc., then let C be constructed by the following paritycheck matrix:
Theorem 6. For Construction 1, all minor codes have minimum distances at least d_{a}, and C has minimum distance at least d_{b}.
Proof. From the form of H, it is clear that all C_{i}, i=1, . . . , M, are subcodes of shortened versions of C_{a}. It is also clear that C is a subcode of C_{b}.
Corollary 2. If both C_{a }and C_{b }are MDS codes, and n_{i}≧d_{a}, for all i, then the code given by Construction 1 achieves the bound of (1) with equality. Proof. For all j,
Therefore, d=d_{b}, d_{i}=d_{a }for all i, and
d=r+1−(M−1)(d _{1}−1).
In order to achieve the bound of (1), Construction 1 requires the existence of a linear MDS code of length n and minimum distance d, which implies that q, the size of the field, cannot be much smaller than n. For ReedSolomon codes (and their single and double extensions), we will need q≧n−1.
In cases where we don't have the luxury of a larger field size, we can try to do something similar. A very related class of codes has been known as integrated interleaving, which we describe in the follows. With a small loss of generality, let us assume that n_{i}=n_{1 }for all i. Suppose we have two length−n_{1 }linear codes C_{a}⊃C_{b }over GF(q), with dimensions k_{a}>k_{b }and minimum distances d_{a}<d_{b}, respectively. If H_{a }is the paritycheck matrix for C_{a}, H_{b }is the paritycheck matrix for C_{b}, and that
then the paritycheck matrix for C can be written as
where I is the M×M identity matrix, M<q, {circle around (x)} denotes the Kronecker product, and
where α is a primitive root in GF(q), and 1≦B<M.
It is shown that C is a [Mn_{1}, (M−B)k_{a}+Bk_{b}] code with minimum distance d=min{(B+1)d_{a}, d_{b}}. If we check these parameters against (1), we can see that integrated interleaving will achieve the bound of (1) with equality if and only if both C_{a }and C_{b }are MDS, B=1, and d=d_{b}. In particular, if it is so desired that d>2d_{a}, then integrated interleaving is not optimal in the sense that it does not achieve the bound of (1).
Essentially, to form the codebook of C, integrated interleaving puts M codewords of C_{a }side by side and then expurgates from the codebook of all such (longer) codewords by requiring that certain weighted sums of the shorter codewords must lie in C_{b}. It is noteworthy that the expurgation is done so that there always exist codewords with weight (B+1)d_{a}. More specifically, if γ*=0, and c _{a}∈C_{a}, then γ{circle around (x)}c _{a }is a codeword of C. Note that γ*=0 if and only if γ is a codeword in a [M,M−B] RS code (defined by paritycheck matrix Γ). Therefore, C has at least
codewords of weight (B+1)d_{1}, where A_{d} _{ a }is the number of minimum weight codewords in C_{a}.
To see why we may be able to do better, consider first the simplest case where M=2, B=1. In this case, the paritycheck matrix for the integrated interleaving construction is
Clearly, if c _{1},c _{2}∈C_{a }have the same H_{ba}syndrome, then (c _{1},c _{2})∈C. In particular, this allows minimum weight codewords in C_{a }to pair up. Intuitively, we would rather like to keep those (c _{1},c _{2}) pairs such that when c _{1 }has low weight, then c _{2 }has large weight, or at least is likely to. Motivated by this observation, we instead construct C to conform to the following paritycheck constraints:
where Q is a fullrank square matrix. Q allows us to pair up codewords of C_{a }that lie in different cosets of C_{b}, rather than those in the same coset. In a sense, it “scrambles” the coset associations so that we may hope to have a “spectrum thinning” effect similar to what interleavers have on Turbo codes.
In general, our construction is as follows.
Construction 2. Let C_{a }and C_{b}⊂C_{a }be linear codes over GF(q) of length n_{1 }and minimum distances d_{a }and d_{b}, respectively. Let H_{a }be a paritycheck matrix for C_{a}, and
be a paritycheck matrix for C_{b}. The constructed code, C is given by the following paritycheck matrix:
where Q_{i}, i=1, . . . , M−1 are fullrank square matrices.
Theorem 7. For Construction 2, we have
min{d _{b},2d _{a} }≦d≦d _{b},
d _{i} =d _{a} ,∀i,
r=r _{b}+(M−1)r _{a}.
Proof. To show the bounds on d, first note that for all i,
is a paritycheck matrix for C_{b}. Now, if a nonzero codeword of C has weights in only one minor code, then its weight must be at least d_{b}. If a nonzero codeword C has weights in m>1 minor codes, then its weight must be at least md_{a}. On the other hand, the minimum distance of C cannot be greater than d_{b}, since any minimumweight codeword of C_{b }followed by all zeros is a codeword in C.
To show d_{i}=d_{a }for all i, it suffices to show C_{i}=C_{a }for all i. Clearly, C_{i} ⊂C_{i}. Note that GF(q)^{n} ^{ 1 }is partitioned into q^{r} ^{ b }cosets of C_{b}, corresponding to the q^{r} ^{ b }H_{b}syndromes. Out of these, those cosets whose corresponding H_{b}syndromes start with r_{a }zeros form a set partition of C_{a}. The above observation shows that there exist (an equal number of) codewords in C_{a }corresponding to any H_{ba}syndrome. As a consequence, for all c∈C_{a}, and i≠j, there exists c′∈C_{a}, such that c′H_{ba} ^{T}=cH_{ba} ^{T}Q_{i} ^{T }(Q_{j} ^{T})^{−1}. (We assume that Q_{0 }is the identity matrix for consistency.) Therefore, there exists a codeword in C whose projection to the ith minor code is c, to the jth minor code −c′, and zero to all other minor codes, which implies that c∈C_{i}.
The claim about r follows directly from the construction.
Corollary 3. Construction 2 achieves (1) if and only if d=d_{b }and C_{a }and C_{b }are both MDS codes.
In particular, if d_{b}≦2d_{a}, then Construction 2 achieves (1) as long as C_{a }and C_{b }are MDS. Recall that in this case, integrated interleaving (with B=1) can do as well.^{b } ^{b}In terms of minimum distance. Construction 2, when properly designed, will have a better weight distribution for correcting random errors.
On the other hand, if d_{b}>2d_{a}, then d≧2d_{a }(while note that for integrated interleaving, d=2d_{a }in this case). But it is not clear how close Construction 2 can get to the bound. As starters, the following theorems give conditions on when and only when Construction 2 can achieve d>2d_{a}.
Theorem 8. For Construction 2, if d>2d_{a}, then
MA _{d} _{ a } <q ^{r} ^{ b } ^{−r} ^{ a },
where A_{d} _{ a }is the number of minimum weight codewords in C_{a}, and r_{b }and r_{a }are the redundancy of C_{b }and C_{a}, respectively.
Proof. Let Q_{0 }be the (r_{b}−r_{a})×(r_{b}−r_{a}) identity matrix. Let S_{i}={s:s=cH_{ba} ^{T}Q_{i} ^{T},c∈C_{a}, wt(c)=d_{a}}, i=0, . . . , M−1. We note that S_{i}∩S_{j}=Ø for all i≠j, for were it true that c _{1}H_{ba} ^{T}Q_{i} ^{T}=c _{2}H_{ba} ^{T}Q_{j} ^{T }for some c _{1}, c _{2}∈C_{a}, wt(c _{1})=wt(c _{2})=d_{a}, and 0≦i<j<M, then
where each 0 is an allzero n_{1}row vector, would be a codeword in C with weight 2d_{a}. Note also that by Theorem 7, d_{b}>2d_{a}. Thus, all minimumweight codewords of C_{a }are correctable error patterns in C_{b}, so they must have distinct H_{b}syndromes and hence distinct H_{ba}syndromes. Therefore, S_{0}=A_{d} _{ a }. Since Q_{i}'s are fullrank, we have S_{i}=A_{d} _{ a }for all i. The claim of the theorem then follows from the fact that
Lemma 1. The number of n×n fullrank matrices over GF(q) is
Theorem 9. For Construction 2, there exist Q_{i}, i=1, . . . , M−1, such that d>2d_{a}, if d_{b}>2d_{a }and the following is satisfied:
Where A_{d} _{ a }is the number of minimumweight codewords in C_{a}, and ρ=r_{b}−r_{a }Proof. For clarity, first consider the case where M=2. Since there is only Q_{1}, we drop the subscript and denote Q=Q_{1}. Let S be the set of H_{ba}syndromes of all minimum weight codewords of C_{a}, i.e., S={s:s=cH_{ba} ^{T},c∈C_{a}, wt(c)=d_{a}}. Then d>2d_{a }if and only if S∩SQ^{T}=Ø. Let S′⊂S denote the set of syndromes whose first nonzero element is 1. Then due to linearity of the code, S∩SQ^{T}=Ø if and only if S∩S′Q^{T}=Ø. For all s∈S′ and s∈S, let Ω_{ s′,s }={Q:s′Q^{T}=s}. If ∪_{ s′∈S′,s∈S}Ω_{ s′,s } is strictly less the number of fullrank ρ×ρ matrices, then we are done. Note that S=A_{d} _{ a }and S′=A_{d} _{ a }/(q−1), and for all s, s′, Ω_{ s′,s }=q^{ρ(ρ−1)}, so by Lemma 1 and the union bound, a sufficient condition for the desired Q to exist is
which is equivalent to
For M>2, let S_{i}=SQ_{i} ^{T}, i=0, . . . , M−1, (as in previous proofs, assume Q_{0 }is the identity matrix). Let S′_{i }denote the subset of S_{i }whose first nonzero element is 1. Then we have d>2d_{a }if and only if S′_{i}∩S_{j}=Ø for all i≠j, which is equivalent to that S′_{i}∩∪_{j<i}S_{j}=Ø for all i. For all s′∈S′ and s∈∪_{j<i}S_{j}, let Ω_{ s′,s } ^{(i)}={Q:s′Q^{T}=s}. Denote
Then d>2d_{a }if and only if Q_{i}∉Ω^{(i)}, i=1, . . . , M−1. Let Γ_{ρ} be the set of all ρ×ρ matrices over GF(q) of full rank. Now, suppose Q_{i}'s are chosen according to a “greedy” algorithm, where we successively choose Q_{1}∈Γ_{ρ}\Ω^{(1)}, Q_{2}∈Γ_{ρ}\Ω^{(2)}, . . . , Q_{M−1}∈Γ_{ρ}\Ω^{(M−1)}. This algorithm will succeed if Γ_{ρ}\Ω^{(i)}>0 for all i, which is ensured if Γ_{ρ}\Ω^{(M−1)}>0, since clearly Ω^{(j)}⊂Ω^{(i) }for all j<i. Using the union bound, we have
Setting the above expression to be positive gives a sufficient condition for the desired Q_{i}'s to exist and completes the proof.
Note that the above proof can also be done by using probabilistic methods, i.e., consider Q_{i }as random matrices and show the expected amount of overlap amongst certain syndrome sets are small, hence the existence of at least one such matrix such that the corresponding syndrome sets will be disjoint. The result of Theorem 9 can be slightly improved by noticing multiple counting of certain matrices in applying the union bound to
Corollary 4. For Construction 2, there exist Q_{i}, i=1, . . . , M−1, such that d>2d_{a}, if d_{b}>2d_{a }and the following is satisfied:
Where A_{i }is the number of minimumweight codewords in C_{a}, and ρ=r_{b}−r_{a}. Proof. Note that for all 0≠a∈GF(q) and i=0, . . . , M−2, there are at least S′ pairs (s′,s), s′∈S′, s∈∪_{j<M−1}S_{j}, such that s=aQ_{i}∈Ω_{ s′,s } ^{(M−1)}. Namely, for each s′∈S′ and s=as′Q_{i} ^{T}. Therefore, when we appeal to the union bound in the final derivation in the proof of Theorem 9, there are (q−1)(M−1) matrices each of which is counted S′ times. The claimed result is then shown by adding a correction term of (q−1)(M−1)(S′−1) to expression (3) and following through.
Let us consider some examples. Suppose M=2, and C_{a }is a [7,6,2] code over GF(8). The bounds of Theorem 8 and Theorem 9 imply that to get d>4 it is necessary that ρ≧3, and it is sufficient to have ρ>3. Through computer search, we find a code that for ρ=3 achieves d=5. It is easy to verify that in this case, the bound of (1) is achieved with equality. This results in a [14,9,5] code over GF(8). The paritycheck matrix for this code is as follows:
where
and α is a primitive root in GF(8), α^{3}+α+1=0.
As another example, let M=2 and C_{a }be a [6,5,2] code over GF(8). We show here a code that achieves d=6 with ρ=4. This is a [12,6,6] code over GF(8) and its paritycheck matrix for this code is as follows:
where
and α is a primitive root in GF(8), α^{3}+α+1=0.
Note that for the same length and distance parameters, Construction 1 would have required a larger field size for both the example codes we have shown.
A few further remarks on the code constructions in this section.

 1) Consider a special case of Construction 1, using RS codes. Let n=Mn_{1 }and H_{b}=[α^{i(j−1)}]_{i=1,j=1} ^{i=r} ^{ b } ^{,j=n}, where α is a primitive root in GF(q), q>n. Then the code given by Construction 1 fits in Construction 2 as well, where all Q_{i }have a diagonal form.
This is interesting because it shows the connection between Construction 1 and Construction 2. Moreover, such code can be decoded using an algorithm for decoding either class of codes.  2) In integrated interleaving, the design parameter B corresponds to the number of “bursty subblocks”. If bursty errors are of concern to us, Construction 2 can be readily extended in the following way, in which case C is defined by the following paritycheck matrix.
 1) Consider a special case of Construction 1, using RS codes. Let n=Mn_{1 }and H_{b}=[α^{i(j−1)}]_{i=1,j=1} ^{i=r} ^{ b } ^{,j=n}, where α is a primitive root in GF(q), q>n. Then the code given by Construction 1 fits in Construction 2 as well, where all Q_{i }have a diagonal form.

 where

 3) Theorem 9 is nonconstructive. It is an open question to systematically find good Q_{i}'s for Construction 2.
 4) It turns out that Construction 2 fits in the general framework of Generalized Concatenated (GC) codes, or that of Generalized Error Location (GEL) codes, the two of which have been shown to be equivalent. Hence, in principle, most of the general results regarding these two classes of codes can be applied to Construction 2. Still, there are things that the general results do not provide for. For example, the decoding algorithms for these codes generally only decodes up to half the lower bound of the minimum distance. So it is not clear how we may take advantage of the fact when the minimum distance is higher (an example is for when d>2d_{a}). Also, in these frameworks, the idea is to build longer, stronger codes using smaller, simpler codes. There, whether the smaller codes would facilitate subline access or not is not of concern. In our case, the idea of local error detection and global error correction will have an impact on how these codes are used, esp. how they can be decoded.
The basic idea, like we already discussed in the motivation to coming up with Construction 2, is to have properly designed constraints on the syndromes that correspond to the C_{b}cosets in C_{a}.  5) Construction 2 may be useful in constructing what is known as AlmostMDS (AMDS) codes. A code is AlmostMDS if its redundancy is equal to its minimum distance. For a given field size and minimum distance, people have found upper and lower bounds on the largest code length possible. In Construction 2, if M=2, d_{a}=2, both C_{a }and C_{b }are MDS, and d=d_{b}, then C will be an AMDS code. As an example, note that the [14,9,5] code over GF(8) we showed earlier is AMDS, and it can be lengthened in 4 steps to obtain a [18,13,5] code, whose paritycheck matrix is given as follows:
H_{18,13,5}=[HH′],  where H is the paritycheck matrix for the [14,9,5] code, and
Decoding is in principle two steps for subline access.
1. Error detection in C_{i}, and if errors are detected,
2. Error correction in C.
The first step is usually easy. A straightforward way is to check if the syndrome of the received subline is zero in the corresponding minor code. The complexity of such a check is very manageable. For the second step, brute force bounded distance decoding would involve a table lookup for all error patterns of length n and weight less than d/2. This may not be practical for complexity reasons. In addition, note that there may be error patterns that are easily correctable due to the subline structure of the code, but whose weights are d/2 or greater. For example, if each minor code has a minimum distance of 3, then any error pattern that involves at most one error in each subline can be easily corrected by decoding the sublines in their respective minor codes. On the other hand, such error patterns can have weights up to M, the number of sublines in a codeword, irrespective of d.
The above problems can be addressed partly by using a decoding algorithm that takes advantage of the subline structure of the code. For that, we consider the two constructions that have been proposed. Here, we assume that C_{a }and C_{b }are well structured codes with simple decoding algorithms (e.g. RS codes, Hamming codes, etc.).
For Construction 1, the second decoding step can be simplified by decoding the received word in C_{b }instead of in C. This may incur some performance loss (as it really does not take advantage of the minor codes) but will likely have a complexity advantage. Another option is to use an algorithm similar to the one described below (for codes of Construction 2).
Now, consider Construction 2. First, let's see how we may decode a very specific code. Let C be the [14,9,5] code over GF(8) that we constructed in the previous section. Let c be the transmitted/stored codeword, which is received/read in error as r=c+e. Or, denoting the first and last 7 symbols separately, we write (r _{1},r _{2})=(c _{1},c _{2})+(e _{1},e _{2}). If wt(e)≦2, then the following decoding algorithm is guaranteed to correct all errors. In the follows, all decoding are bounded distance decoding.

 1. Let R:={i:r _{i}H_{a} ^{T}≠0}. Let s _{i}=r _{i}H_{ba} ^{T}, i=1, 2.
 2. If R is empty,
 (a) If s _{1}+s _{2}Q^{T}=0, output ĉ=r, return.
 (b) If s _{1}+s _{2}Q^{T}≠0,
 i. Let ŝ _{1}:=−s _{2}Q^{T}. Decode r _{1 }in the coset of C_{2 }with syndrome (0,ŝ _{1}). That is, solve ê _{1}H_{b} ^{T}=(0,s _{1}−ŝ _{1}) for ê _{1}, such that wt(ê _{1})≦2. If such ê _{1 }is found, then let ĉ _{1}:=r _{1}−ê _{1}, output ĉ=(ĉ _{1},r _{2}), return.
 ii. If no such ê _{1 }can be found, then let ŝ _{2}:=−s _{1}Q^{T} ^{ −1 }, and decode r _{2 }in the coset of C_{b }with syndrome (0,ŝ _{2}) If r _{2 }is successfully decoded to ĉ _{2}, then output ĉ=(r _{1}, ĉ _{2}), return. Otherwise, declare a decoding failure.
 3. If R=1,
 (a) If R={1}, then let ŝ _{1}:=−s _{2}Q^{T }decode r _{1 }in the coset of C_{b }with syndrome (0,ŝ _{1}). If r _{1 }is successfully decoded to ĉ _{1}, then output ĉ=(ĉ _{1},r _{2}), return. Otherwise, declare a decoding failure.
 (b) If R={2}, then let ŝ _{2}:=−s _{1}Q^{T} ^{ −1 }, and decode r _{2 }in the coset of C_{b }with syndrome (0,ŝ _{2}). If r _{2 }is successfully decoded to ĉ _{2}, then output ĉ=(r _{1},ĉ _{2}), return. Otherwise, declare a decoding failure.
 4. If R={1,2}, then
 (a) Let ε_{i}:=r _{i}H_{a} ^{T}, i=1,2. For j=1, . . . , 7, calculate t _{j}=(t_{j,1},t_{j,2},t_{j,3}):=s _{2}−ε_{2} h _{j}, where h _{j }is the jth row of H_{ba} ^{T}Q^{T}.
 (b) Find j* such that t_{j*,2}/t_{j*,1}=t_{j*,3}/t_{j*,2}. Let k*=log_{α}(t_{j*,2}/t_{j*,1}). If no such j* exists, declare a decoding failure.
 (c) Let u _{j }denote the (length7) unit vector whose elements are all zeros except the jth being a one. Let ĉ _{1}=r _{1}−ε_{1} u _{k*}, and ĉ _{2}=r _{2}−ε_{2} u _{j*}. Output ĉ=(ĉ _{1},ĉ _{2}). Return.
A major part of the above algorithm can be generalized to all codes of Construction 2. The algorithm is given as follows and bears resemblance to the decoding algorithm used for intergrated interleaving schemes. In the algorithm, the “syndrome constraint” refers to the fact that for all c=(c _{0}, . . . , c _{M−1})∈C, we have Σ_{i=0} ^{M−1} s _{i}Q_{i} ^{T}=0, where s _{i}=c _{i}H_{ba} ^{T }is the H_{ba}syndrome for subline i.

 1. Decode each subline in C_{a}. Record the number of errors corrected for subline i as τ_{i}. τ_{i}=∞ if decoding failed for subline i.
 2. If there are more than one i such that τ_{i}=∞, declare a decoding failure.
 3. If there is exactly one i such that τ_{i}=∞, then assuming all other sublines are correctly decoded, solve for the H_{ba}syndrome of subline i using the syndrome constraint, and decode r _{i }in the corresponding coset of C_{b}. If successful, return; otherwise, declare decoding failure.
 4. If for all i, τ_{i}<∞, then check if the syndrome constraint is satisfied. If it is, accept the outputs from subline decoders and return. Otherwise, sort {τ_{i}} as {τ_{i} _{ j }}_{1≦j≦M}, such that τ_{i} _{ 1 }≧τ_{i} _{ 2 }. . . ≧τ_{i} _{ M }. For j=1 to M, assume that the H_{ba}syndrome for subline i_{j }is erasure and solve for it using the syndrome constraint. Assuming this H_{ba}syndrome for subline i_{j}, decode r _{i} _{ j }in the corresponding coset; if successful, break and return. If the above procedure fails for all j, declare a decoding failure.
Theorem 10. Let t_{i }denote the number of errors occurred in subline i, i=1, . . . , M. Let E_{a}={i:t_{i}≧d_{a}/2}. The above algorithm corrects all error patterns such that  1. E_{a}=Ø, or
 2. E_{a}={i}, t_{i}<d_{b}/2, and
Proof. If E_{a}=Ø, then all sublines will be corrected and decoding will be successful. If E_{a}={i} and τ_{i}=∞, then all sublines except the ith one will be correctly decoded. The H_{ba}syndrome for subline i is thus correctly inferred and the t_{i }errors will be corrected since t_{i}<d_{b}/2. Finally, if E_{a}={i} but τ_{i}<∞ for all i, suppose when assuming the H_{ba}syndrome for subline j is erasure the decoding algorithm miscorrects the received word to a codeword other than the original one. By the way the algorithm works, this only happens when τ_{i}≦τ_{j}=t_{j}. Let τ′_{j }denote the number of “errors” found when subline j is miscorrected in the last step when it is decoded in a coset of C_{b}. We note that the Hamming distance between the correct codeword and the miscorrected output is at most
which is a contradiction.
We remark that with appropriate modifications the above algorithm can also be used to decode codes of Construction 1.
Finally, note that depending on the requirements of the system to which the code is to be applied, the twostep decoding principle that we started with at the beginning of this section may be generalized in such a way that in the first step, instead of pure error detection, a combination of error correction and error detection is performed. For example, one may use the minor code alone to correct the retrieved subline up to a certain (small) number of errors, and the line code will only be invoked for further error correction if more than that many errors are detected in the subline. Such modified decoding principle will generally trade off some reliability for the benefits of less falling back on decoding the full line. Depending on particular system scenarios, this may be a reasonable thing to do.
While the invention has been described in terms of a single preferred embodiment, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims. For example, while the invention has been described in terms of memories using dynamic random access memory (DRAM) devices, those skilled in the art will recognize that the invention can be practiced using other and different memory technology. Further, the description of the invention has been in the context of the redundant symbols stored together with and/or within each line of data, but this is not necessary for the practice of the invention. Some or all of the redundant symbols may be stored in a special purpose memory distinct from the memory that stores the sublines of data. For example, the shared redundancy might be stored using static random access memory (SRAM) technology. Moreover, the invention is not limited to applications to main memories but may be advantageously applied to storage medium incorporated on a processor chip having one or more processor cores.
Claims (20)
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

US11619929 US7895502B2 (en)  20070104  20070104  Error control coding methods for memories with subline accesses 
Applications Claiming Priority (2)
Application Number  Priority Date  Filing Date  Title 

US11619929 US7895502B2 (en)  20070104  20070104  Error control coding methods for memories with subline accesses 
CN 200810001928 CN101231891A (en)  20070104  20080103  Error control method and memory system 
Publications (2)
Publication Number  Publication Date 

US20080168329A1 true US20080168329A1 (en)  20080710 
US7895502B2 true US7895502B2 (en)  20110222 
Family
ID=39595313
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

US11619929 Expired  Fee Related US7895502B2 (en)  20070104  20070104  Error control coding methods for memories with subline accesses 
Country Status (2)
Country  Link 

US (1)  US7895502B2 (en) 
CN (1)  CN101231891A (en) 
Cited By (7)
Publication number  Priority date  Publication date  Assignee  Title 

US20070255999A1 (en) *  20060407  20071101  Qimonda Ag  Memory Arrangement And Method For Error Correction 
US20120159283A1 (en) *  20101220  20120621  Lu ShihLien L  Low overhead error correcting code protection for stored information 
US8239731B1 (en) *  20070706  20120807  Marvell International Ltd.  Methods and apparatus for providing multilevel coset coding and probabilistic error correction 
US8627167B1 (en) *  20070108  20140107  Marvell International Ltd.  Methods and apparatus for providing multilayered coding for memory devices 
US20140068391A1 (en) *  20120901  20140306  Manish Goel  Memory with Segmented Error Correction Codes 
US20140108889A1 (en) *  20110606  20140417  Rambus Inc.  Memory system for error detection and correction coverage 
US9898365B2 (en)  20130731  20180220  Hewlett Packard Enterprise Development Lp  Global error correction 
Families Citing this family (15)
Publication number  Priority date  Publication date  Assignee  Title 

US7930611B2 (en) *  20070309  20110419  Microsoft Corporation  Erasureresilient codes having multiple protection groups 
US7904782B2 (en) *  20070309  20110308  Microsoft Corporation  Multiple protection group codes having maximally recoverable property 
US7996710B2 (en) *  20070425  20110809  HewlettPackard Development Company, L.P.  Defect management for a semiconductor memory system 
US8468416B2 (en) *  20070626  20130618  International Business Machines Corporation  Combined group ECC protection and subgroup parity protection 
US8041990B2 (en) *  20070628  20111018  International Business Machines Corporation  System and method for error correction and detection in a memory system 
US8041989B2 (en) *  20070628  20111018  International Business Machines Corporation  System and method for providing a high fault tolerant memory system 
US8281216B2 (en) *  20090331  20121002  Motorola Solutions, Inc.  Method for assigning and utilizing forward error correcting (FEC) codes 
JP5723967B2 (en)  20100330  20150527  インターナショナル・ビジネス・マシーンズ・コーポレーションＩｎｔｅｒｎａｔｉｏｎａｌ Ｂｕｓｉｎｅｓｓ Ｍａｃｈｉｎｅｓ Ｃｏｒｐｏｒａｔｉｏｎ  The method for recording input data to the slevel storage of solid state storage device, encoder apparatus, and solidstate storage device 
US8484529B2 (en)  20100624  20130709  International Business Machines Corporation  Error correction and detection in a redundant memory system 
US8549378B2 (en)  20100624  20131001  International Business Machines Corporation  RAIM system using decoding of virtual ECC 
US8631271B2 (en)  20100624  20140114  International Business Machines Corporation  Heterogeneous recovery in a redundant memory system 
US8898511B2 (en)  20100624  20141125  International Business Machines Corporation  Homogeneous recovery in a redundant memory system 
US8522122B2 (en)  20110129  20130827  International Business Machines Corporation  Correcting memory device and memory channel failures in the presence of known memory device failures 
WO2015016877A1 (en) *  20130731  20150205  HewlettPackard Development Company, L.P.  Memory unit 
US20160147598A1 (en) *  20130731  20160526  HewlettPackard Development Company, L.P.  Operating a memory unit 
Citations (8)
Publication number  Priority date  Publication date  Assignee  Title 

US4951284A (en) *  19881214  19900821  International Business Machines Corporation  Method and means for correcting random and burst errors 
US6275965B1 (en) *  19971117  20010814  International Business Machines Corporation  Method and apparatus for efficient error detection and correction in long byte strings using generalized, integrated, interleaved reedsolomon codewords 
US20050058199A1 (en) *  20010305  20050317  Lifeng Zhao  Systems and methods for performing bit rate allocation for a video data stream 
US6903887B2 (en)  20020103  20050607  International Business Machines Corporation  Multiple level (ML), integrated sector format (ISF), error correction code (ECC) encoding and decoding processes for data storage or communication devices and systems 
US20070226592A1 (en) *  20060320  20070927  Micron Technology, Inc.  Variable sectorcount ECC 
US7409623B2 (en) *  20041104  20080805  Sigmatel, Inc.  System and method of reading nonvolatile computer memory 
US7512864B2 (en) *  20050930  20090331  Josef Zeevi  System and method of accessing nonvolatile computer memory 
US7739576B2 (en) *  20060831  20100615  Micron Technology, Inc.  Variable strength ECC 
Patent Citations (8)
Publication number  Priority date  Publication date  Assignee  Title 

US4951284A (en) *  19881214  19900821  International Business Machines Corporation  Method and means for correcting random and burst errors 
US6275965B1 (en) *  19971117  20010814  International Business Machines Corporation  Method and apparatus for efficient error detection and correction in long byte strings using generalized, integrated, interleaved reedsolomon codewords 
US20050058199A1 (en) *  20010305  20050317  Lifeng Zhao  Systems and methods for performing bit rate allocation for a video data stream 
US6903887B2 (en)  20020103  20050607  International Business Machines Corporation  Multiple level (ML), integrated sector format (ISF), error correction code (ECC) encoding and decoding processes for data storage or communication devices and systems 
US7409623B2 (en) *  20041104  20080805  Sigmatel, Inc.  System and method of reading nonvolatile computer memory 
US7512864B2 (en) *  20050930  20090331  Josef Zeevi  System and method of accessing nonvolatile computer memory 
US20070226592A1 (en) *  20060320  20070927  Micron Technology, Inc.  Variable sectorcount ECC 
US7739576B2 (en) *  20060831  20100615  Micron Technology, Inc.  Variable strength ECC 
NonPatent Citations (9)
Title 

Arvind M. Patel, TwoLevel Coding for Error Control in Magnetic Disk Storage Products, IBM J. Res. Develop. vol. 33, No. 4, Jul. 1989, pp. 470484. 
Cilincse Office Action. 
Hassner et al., Integrated InterleavingA Novel ECC Architecture, IEEE Transactions on Magnetics, vol. 37, No. 2, Mar. 2001, pp. 773775. 
Hassner et al., Integrated Interleaving—A Novel ECC Architecture, IEEE Transactions on Magnetics, vol. 37, No. 2, Mar. 2001, pp. 773775. 
Khaled A.S. AbdelGhaffar et al., Multilevel ErrorControl Codes for Data Storage Channels, IEEE Transactions on Information Theory, vol. 37, No. 3, May 1991, pp. 735741. 
Mario A. de Boer, Almost MDS Codes, Designs, Codes and Cryptography, 9, pp. 143155 (1996). 
Maucher et al. On the Equivalence of Generalized Concatenated Codes and Generalized Error Location Codes, IEEE Transactions on Information Theory, vol. 46, No. 2, Mar. 2000, pp. 642649. 
Wolf, On Codes Derivable From the Tensor Product of Check Matrices, IEEE Transactions on Information Theory, Apr. 1965, pp. 281284. 
Yves Edel et al. Lengthening and the GilbertVarshamov Bound, IEEE Transactions on Information Theory, Vo. 43, No. 3, May 1997, pp. 991992. 
Cited By (12)
Publication number  Priority date  Publication date  Assignee  Title 

US20070255999A1 (en) *  20060407  20071101  Qimonda Ag  Memory Arrangement And Method For Error Correction 
US8910013B1 (en)  20070108  20141209  Marvell International Ltd.  Methods and apparatus for providing multilayered coding for memory devices 
US8627167B1 (en) *  20070108  20140107  Marvell International Ltd.  Methods and apparatus for providing multilayered coding for memory devices 
US8239731B1 (en) *  20070706  20120807  Marvell International Ltd.  Methods and apparatus for providing multilevel coset coding and probabilistic error correction 
US8402345B1 (en)  20070706  20130319  Marvell International Ltd.  Methods and apparatus for providing multilevel coset coding and probabilistic error correction 
US20120159283A1 (en) *  20101220  20120621  Lu ShihLien L  Low overhead error correcting code protection for stored information 
US8539303B2 (en) *  20101220  20130917  Intel Corporation  Low overhead error correcting code protection for stored information 
US20140108889A1 (en) *  20110606  20140417  Rambus Inc.  Memory system for error detection and correction coverage 
US9218243B2 (en) *  20110606  20151222  Rambus Inc.  Memory system for error detection and correction coverage 
US20140068391A1 (en) *  20120901  20140306  Manish Goel  Memory with Segmented Error Correction Codes 
US8745472B2 (en) *  20120901  20140603  Texas Instruments Incorporated  Memory with segmented error correction codes 
US9898365B2 (en)  20130731  20180220  Hewlett Packard Enterprise Development Lp  Global error correction 
Also Published As
Publication number  Publication date  Type 

US20080168329A1 (en)  20080710  application 
CN101231891A (en)  20080730  application 
Similar Documents
Publication  Publication Date  Title 

US7103824B2 (en)  Multidimensional data protection and mirroring method for micro level data  
US6247157B1 (en)  Method of encoding data signals for storage  
US4569052A (en)  Coset code generator for computer memory protection  
US6041430A (en)  Error detection and correction code for data and check code fields  
US7398449B1 (en)  Encoding 64bit data nibble error correct and cyclicredundancy code (CRC) address error detect for use on a 76bit memory module  
US20090013234A1 (en)  Data storage with an outer block code and a streambased inner code  
US7188296B1 (en)  ECC for component failures using Galois fields  
US20100217915A1 (en)  High availability memory system  
US7278085B1 (en)  Simple errorcorrection codes for data buffers  
US20030023922A1 (en)  Fault tolerant magnetoresistive solidstate storage device  
US5721739A (en)  Method for detecting read errors, correcting singlebit read errors and reporting multiplebit read errors  
US20080155191A1 (en)  Systems and methods for providing heterogeneous storage systems  
US5745508A (en)  Errordetection code  
US5966389A (en)  Flexible ECC/parity bit architecture  
US6557123B1 (en)  Data redundancy methods and apparatus  
US7386757B2 (en)  Method and apparatus for enabling highreliability storage of distributed data on a plurality of independent storage devices  
US20040098654A1 (en)  FIFO memory with ECC function  
US20090006900A1 (en)  System and method for providing a high fault tolerant memory system  
US7949931B2 (en)  Systems and methods for error detection in a memory system  
US7127668B2 (en)  Data management architecture  
US5007053A (en)  Method and apparatus for checksum address generation in a failsafe modular memory  
US20070011562A1 (en)  Mitigating silent data corruption in a buffered memory module architecture  
US20050086575A1 (en)  Generalized parity stripe data storage array  
US5922080A (en)  Method and apparatus for performing error detection and correction with memory devices  
US20100262889A1 (en)  Reliability, availability, and serviceability in a memory device 
Legal Events
Date  Code  Title  Description 

AS  Assignment 
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HAN, JUNSHENG;LASTRASMONTANO, LUIS A.;TROMBLEY, MICHAELR.;REEL/FRAME:018721/0096;SIGNING DATES FROM 20061212 TO 20061213 

REMI  Maintenance fee reminder mailed  
LAPS  Lapse for failure to pay maintenance fees  
FP  Expired due to failure to pay maintenance fee 
Effective date: 20150222 