WO2016038673A1 - Dispositif de correction d'erreurs, procédé de correction d'erreurs et système de correction d'erreurs - Google Patents
Dispositif de correction d'erreurs, procédé de correction d'erreurs et système de correction d'erreurs Download PDFInfo
- Publication number
- WO2016038673A1 WO2016038673A1 PCT/JP2014/073760 JP2014073760W WO2016038673A1 WO 2016038673 A1 WO2016038673 A1 WO 2016038673A1 JP 2014073760 W JP2014073760 W JP 2014073760W WO 2016038673 A1 WO2016038673 A1 WO 2016038673A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- code
- error
- data
- type
- memory
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/16—Protection against loss of memory contents
Definitions
- the present invention relates to an error correction device, an error correction method, and an error correction system.
- FIG. 1 shows a configuration example of an 8-layer 8-channel stacked memory.
- a stacked memory 100 in which a plurality of memory chips 110 are connected to each other by a TSV (through silicon via) 140 as shown in FIG.
- TSV through silicon via
- HBM High Bandwidth Memory
- HMC Hybrid Memory Cube
- the stacked memory 100 may include not only the memory chip 110 but also a control chip 120 as shown in FIG. An interface between the stacked memory 100 and the outside is called a channel 130.
- the stacked memory 100 may include a plurality of channels 130.
- FIG. 1 illustrates an example in which eight channels 130 are mounted on the stacked memory 100 including eight memory chips 110 and one control chip 120. .
- the first type of failure is a transient failure in which data in the memory is temporarily destroyed when, for example, neutrons or ⁇ rays collide and pass through the memory chip.
- the second type of failure is a permanent failure in which the circuit cannot satisfy a desired function due to, for example, circuit wear or the like, and data is permanently destroyed after the failure occurs. Transient faults are also called soft errors.
- error detection and soft error countermeasures using correction codes that can detect and correct an arbitrary 1-bit error in data by redundantly adding check bits to data have been performed so far. It was.
- the error correction code and the error detection code are collectively referred to as an error control code.
- a 1-bit error correction-2 bit error detection code (SEC-DED code: Single Error Correction-Double Error Detection Code) capable of detecting an error occurring in a bit is widely known.
- FIG. 2 shows an example of the channel format of HBM.
- a channel that is an interface to a memory has a configuration in which a 16-bit check bit 220 can be added to a 128-bit data bit 210.
- FIG. 3 shows an example of an HMC channel format.
- the HMC includes four sets of 32-bit data bits 310 and 4-bit check bits 320, and has a configuration in which a total of 128-bit data bits and a total of 16-bit check bits are combined. Therefore, in both the HBM and HMC examples, 128 bits of data and 16 bits of check bits are considered as two sets of 64 bits of data and 8 bits of check bits. It is possible to apply a method similar to the conventional method.
- byte error detection and correction codes capable of detecting and correcting errors are used as permanent failure countermeasure techniques even when a plurality of bits in a batch are erroneous at the same time.
- the byte is a unit composed of a plurality of consecutive bits, and the number of bits constituting the byte is called a byte length.
- FIG. 4 shows a configuration example of x4 DIMM.
- a DIMM 400 (Dual Inline Memory Module) combines output bits from a plurality of memory chips 410 mounted on the DIMM 400 as shown in FIG. 4 to form a desired data width.
- a failure due to the influence of the entire memory chip 410 occurs due to a failure of the row address decoder, a failure of the power supply circuit, or the like, a plurality of bits output from the failed memory chip are failed.
- a DIMM 400 that outputs 4 bits from each memory chip 410 constitutes 64-bit data by collecting output bits from 16 memory chips 410.
- a 4-bit block output from the failed memory chip out of 64-bit data is erroneous.
- a byte error control positive code capable of detecting and correcting these byte errors is applied to a permanent failure of a memory chip failure. For example, in the case of an error with a byte length of 4, an arbitrary 1-byte error in a total of 144 bits of data and check bits is corrected by adding a 16-bit check bit to 128-bit data.
- a 1-byte error correction code that can detect byte errors and a 2-byte error detection code S4EC-D4ED code: Single 4-bit Error Correction-Double 4-bit Error Detection Code
- Patent Document 1 describes a cross-interleaved Reed-Solomon code (CIRC) that forms a code having higher error control capability than a case where each code is applied independently by applying two codes in combination. Yes.
- Non-Patent Document 1 discloses a specific configuration method of the SEC-DED-SbED code that has an error control capability equivalent to that of the SEC-DED code and can detect a byte error having a byte length of b bits. Are listed.
- the stacked memory 100 used in the high-reliability application field such as the HPC field and the data center, it is desirable to apply the permanent failure countermeasure of the memory unit or channel.
- the stacked memory 100 such as an HBM or HMC is provided as a module in which a plurality of memory chips 110 are stacked by the TSV 140, and a computer chip for the computer system designer to separately output a test bit cannot be added.
- bit widths of the data bits and check bits in the channel 130 that are interfaces with the stacked memory 100 are defined in advance by the specifications, and the computer system designer cannot increase the check bits in the channel. Therefore, when an error control code is applied as a countermeasure against a permanent failure of a memory unit or channel in a stacked memory 100 such as an HBM or an HMC, only codes that can be configured with the number of check bits determined in advance may be applied. Can not.
- each channel 130 is built in each memory chip 110, for example, when a permanent failure occurs in the memory chip 110 that is an example of the memory unit, 128 bits of data output from the channel corresponding to the memory chip 110, All of 144 bits in total including 16 check bits are faulty. That is, in this case, it is necessary to handle an error having a byte length longer than that of the conventional DIMM 400. More check bits are required for error detection and control of a long byte length. Therefore, it is impossible to apply the permanent failure countermeasure technique in the conventional DIMM, the cross interleaved Reed-Solomon code described in Patent Document 1, the SEC-DED-SbED code described in Non-Patent Document 1, and the like.
- the first problem is that the bit widths of data bits and check bits output from each memory unit are determined in advance, and new bits cannot be added as in the conventional DIMM 400.
- the second problem is that when a permanent failure occurs, the number of erroneous bits is larger than that of the conventional DIMM 400 or the like.
- This proposal was devised in view of the above problems, and discloses a configuration and method for detecting and correcting an error caused by a permanent failure of a memory unit or a channel in a memory device.
- the present invention employs the following configuration, for example.
- An error correction device for reading data from a memory device and correcting an error in the read data, wherein the memory device is encoded by a code process in a product code of a first type code and a second type code In the first data, an arbitrary first type codeword encoded by the first type code and an arbitrary second type codeword encoded by the second type code.
- the data length that overlaps with the second type code is less than or equal to the byte length
- the error correction device reads the first data from the memory device and the read first data
- Decryption process And when the second decoding processing unit detects an uncorrectable error in the first type 2 codeword acquired from the memory device, the first type 2 codeword generates an error.
- An error flag indicating inclusion is set, and the first decoding processing unit includes a first byte included in the first type 2 codeword and is stored in each of different memory units in the memory device.
- An error correction device for correcting an error in the first byte based on the error flag in a first type 1 codeword composed of bytes.
- erroneous data output from a failure location can be detected with high accuracy when a memory unit or channel failure occurs in the memory device.
- a memory unit or channel failure occurs in the memory device.
- Example 1 It is a block diagram which shows the structural example of the laminated memory of 8 layers 8 channels. It is a figure which shows the example of the channel format of HBM. It is a figure which shows the example of the channel format of HMC. It is a figure which shows the structural example of x4 DIMM.
- Example 1 it is a block diagram which shows the structural example of an error correction system.
- Example 1 it is a block diagram which shows the structural example of the memory chip by which 1 channel is arrange
- Example 1 it is a figure which shows the 1st example of the byte division
- Example 1 it is a block diagram which shows the structural example of an error check code
- Example 1 it is a block diagram which shows the structural example of a CODE_H decoding process part.
- Example 1 it is a block diagram which shows the structural example of a CODE_V decoding process part.
- Example 1 it is a flowchart which shows the 1st example of a decoding process.
- Example 1 it is a flowchart which shows the 2nd example of a decoding process.
- Example 1 it is a figure which shows the example of the 2nd error pattern when the memory chip in a laminated memory fails.
- Example 1 it is a figure which shows the example of the 3rd error pattern when the memory chip in a laminated memory fails. In Example 1, it is a figure which shows the example of the error pattern in the data of 2 cycles output from the same channel. In Example 1, it is a figure which shows the 2nd example of the byte division
- FIG. 3 is a block diagram illustrating a configuration example of a memory chip in which two channels are arranged in the first embodiment. In Example 1, it is a figure which shows the 1st example of the error control code applied when 2 channels are arrange
- Example 1 it is a figure which shows the example of application of an error control code when a bank failure is assumed.
- Example 2 it is a figure which shows the example of an error control code application using four laminated memories.
- Example 3 it is a figure which shows the channel structural example of HMC.
- Example 3 it is a figure which shows the example of the error pattern by TSV failure.
- Example 3 it is a figure which shows the example of the error pattern before and behind the data rearrangement which fixes the bit which becomes an error by TSV failure to a specific byte.
- FIG. 5 shows a configuration example of the error control system of this embodiment.
- the error control system includes a stacked memory 100 and a processor chip 700 connected to the stacked memory 100.
- the stacked memory 100 is an example of a memory device, and has a configuration similar to that of FIG.
- the processor chip 700 includes a memory controller 710, a plurality of processors 720, and a DMA control unit 730.
- the memory controller 710 performs data error control and read / write control to the memory from the processor 720 and the DMA control unit 730.
- the processor 720 operates in accordance with a program, inputs / outputs data, reads / writes data, and executes each program to be described later.
- the DMA control unit 730 controls communication in DMA transfer.
- the memory controller 710 includes, for example, a memory interface 711, a write control unit 712, an error check code encoding unit 713, a read control unit 714, and an error check code decoding unit 715.
- the memory interface 711 is an interface that inputs and outputs data and the like from the stacked memory 100.
- the write control unit 712 / read control unit 714 is a program and controls writing / reading of data to / from the stacked memory 100 from the processor 720 and the DMA control unit 730.
- the error check code encoding unit 713 includes a program and performs an encoding process on data written to the stacked memory 100.
- the error check code decoding unit 715 includes a program, performs a decoding process on data read from the stacked memory 100, and performs error detection and error correction.
- the error control system of the present embodiment is not limited to the configuration of FIG. 5, for example, the memory controller 710 may be configured in the control chip 120 in the stacked memory 100.
- the program is executed by the processor 720 to perform a predetermined process using the storage device and the memory interface 711. Therefore, in the present embodiment and other embodiments, the description with the program as the subject may be the description with the processor 720 as the subject. Alternatively, the process executed by the program is a process performed by a computer and a computer system on which the program operates.
- the processor 720 operates as a functional unit that realizes a predetermined function by operating according to a program.
- the processor 720 functions as a write control unit by operating according to the write control unit 712, and functions as a read control unit by operating according to the read control unit 714.
- the processor 720 also operates as a functional unit that implements each of a plurality of processes executed by each program.
- a computer and a computer system are an apparatus and a system including these functional units.
- the program can be installed in each computer by a program distribution server or a computer-readable non-transitory storage medium, and can be stored in a nonvolatile storage device of each computer.
- FIG. 6 shows a configuration example of channels when one channel is arranged in one memory chip.
- the stacked memory 100 is composed of eight layers of memory chips and incorporates a total of eight channels. That is, there is a one-to-one correspondence between memory chips and channels.
- FIG. 6 the case where the memory chip 1 (111) fails and the total 144 bits of the data bits 128 bits and the check bits 16 bits input / output when accessing the channel 1 (131) becomes an error is shown.
- the error means that the data read / written to / from the memory has a value different from that originally expected.
- An error control code that detects and corrects a 144-bit error by adding a 16-bit check bit to 128-bit data is not known.
- FIG. 7 shows an example of an error control code applied in this embodiment.
- FIG. 8 shows a first example of the byte division format of the channel.
- the system disclosed in the present embodiment applies a product code based on two error control codes to a data set obtained by collecting a plurality of channels, which is output from a plurality of stacked memories as shown in FIG.
- the error check code encoding unit 713 first divides the total 144 bits of the data bits and check bits of each channel into, for example, a plurality of bytes (B0 to B9 and C0) as shown in FIG. To do. Of the divided bytes, 10 bytes from B0 to B9 each include 13 bits or 11 bits, and a total of 128 data bits. Each byte from B0 to B9 includes one check bit. The error check code encoding unit 713 configures the first code CODE_V using the check bits.
- Each channel includes a byte C0 including 5 check bits in the second code CODE_H and 1 check bit for applying the code CODE_V to the 5 check bits.
- the error check code encoding unit 713 divides each channel out of a total of 16 channels CH0 to CH15 included in two stacked memories (stacked memory 0 (100) and stacked memory 1 (101)). Put together the bytes.
- the error check code encoding unit 713 applies, for example, a SEC-DED-S14ED code as the first code CODE_V to each group of collected bytes.
- the error check code encoding unit 713 collects 16 bytes B0 from channel CH0 to channel CH15.
- the error check code encoding unit 713 similarly applies the (224, 208) SEC-DED-S14ED code to B1 to B9. Note that since the bit length of the data bit to be encoded is not 13 in B9, the error check code encoding unit 713 applies a shortened code to B9.
- each memory chip is a memory unit.
- the error check code encoding unit 713 collects one channel for two cycles (cycle 0 and cycle 1).
- the error check code encoding unit 713 includes a check bit of 10 bits (5 bits ⁇ 2) in total that is a check bit for 2 cycles of C0, and a data bit of 276 bits in total that is data for 2 cycles of B0 to B9 ,
- the second code CODE_H is applied.
- the error check code encoding unit 713 applies, for example, a (286,276) SEC-DED code as CODE_H.
- the SEC-DED code is a code having the ability to correct an arbitrary 1-bit error in a code word and detect an arbitrary 2-bit error. Furthermore, the SEC-DED code can be detected probabilistically even for errors of 3 bits or more.
- the SEC-DED-S14ED code is a code having the capability of detecting an arbitrary 1-byte error in a code word when the byte length is 14 bits in addition to the capability equivalent to the SEC-DED code described above. It is.
- FIG. 9 shows a comparative example in which a code is applied to one cycle of data.
- the example of FIG. 9 is different from the example of FIG. 7 in that CODE_H is applied to 2-channel data in one cycle.
- CODE_H is applied to 2-channel data in one cycle.
- the data of two channels is collected in order to secure the number of check bits for applying the SEC-DED code as CODE_H.
- CODE_H is applied to two cycles of data output from the same memory chip.
- an overlapping portion between an arbitrary codeword encoded by CODE_H and an arbitrary codeword encoded by CODE_V is equal to or less than the byte error detection length (1 byte) in CODE_V. Therefore, when an uncorrectable error due to CODE_H is detected, the channel in which the error has occurred can be uniquely determined by decoding in CODE_V.
- FIG. 10 shows a configuration example of the error check code decoding unit 715.
- the error check code decoding unit 715 includes a CODE_H decoding processing unit 1210 that performs a decoding process on the code CODE_H and a CODE_V decoding processing unit 1220 that performs a decoding process on the code CODE_V.
- the input data output from each channel of the stacked memories 100 to 101 is first input to the CODE_H decoding processing unit 1210, where the SEC-DED code (CODE_H) is decoded.
- the CODE_H decoding processing unit 1210 may perform CODE_H decoding processing on input data output from each channel in parallel.
- FIG. 11 shows a configuration example of the CODE_H decoding processing unit 1210.
- the CODE_H decoding processing unit 1210 includes a buffer 1211 and a syndrome generation unit 1212, an error correction unit 1213, and a syndrome decoding unit 1214, which are programs.
- the buffer 1211 Since the code CODE_H is applied to data for two cycles output from the same channel, the buffer 1211 holds the data one cycle before. Note that the buffer 1211 may be included in the read control unit 714.
- the syndrome generation unit 1212 generates a syndrome in CODE_H for the input data output from the stacked memory.
- the linear code has a matrix called a check matrix that defines each code, and the syndrome is a vector value calculated as a product of the check matrix and the code word.
- the syndrome decoding unit 1214 determines the presence / absence of an error and the location where the error occurred based on the value of the syndrome generated by the syndrome generation unit 1212.
- the syndrome decoding unit 1214 transmits an error occurrence flag signal to the CODE_V decoding processing unit 1220 when it is determined that an uncorrectable error has occurred due to the decoding processing by CODE_H.
- the error correction unit 1213 corrects the error in which the syndrome decoding unit 1214 specifies the occurrence location.
- FIG. 12 shows a configuration example of the CODE_V decoding processing unit 1220.
- the CODE_V decoding processing unit 1220 includes a syndrome generation unit 1221, an error correction unit 1222, a syndrome decoding unit 1223, and an error occurrence flag check unit 1224 which are programs.
- the syndrome generation unit 1221 generates a syndrome in CODE_V for the intermediate data output from the CODE_H decoding processing unit 1210.
- the syndrome decoding unit 1223 determines the presence / absence of an error and the location where the error occurred based on the value of the syndrome generated by the syndrome generation unit 1212.
- the error correction unit 1222 corrects the error whose location has been identified by the syndrome decoding unit 1223.
- the error occurrence flag checking unit 1224 uses the error occurrence flag signal received from the CODE_H decoding processing unit 1210 and the determination result by the syndrome decoding unit 1223 to determine the presence / absence of an error and the location where the error has occurred.
- the error control system of this embodiment detects 100% of 144-bit errors that occur when the memory chip of the stacked memory fails due to the S14ED capability of code CODE_V.
- the 144 bits output from each memory chip are divided into 11 bytes from B0 to C0 as shown in FIG. 8, and are distributed to different CODE_V codewords.
- B0 to B9 are composed of 14 bits
- B9 is composed of 12 bits
- C0 is composed of 6 bits.
- Each codeword of CODE_V includes data of B0 to C0 in each memory chip. Therefore, the CODE-V SEC-DED-S14ED code capable of detecting a 1-byte error with a byte length of 14 makes it possible to detect a 144-bit error that occurs when a memory chip failure occurs.
- the SEC-DED-S14ED code can detect a 1-byte error with a byte length of 14 but cannot correct the error. For this reason, the SEC-DED-S14ED code alone cannot correct a 144-bit error that occurs when a memory chip fails.
- the SEC-DED-S14ED code specifies an error position in a byte according to the generated syndrome if it can separately know which byte is an error among a plurality of bytes constituting the code word. It is possible to correct an error in the byte.
- the SEC-DED-S14ED code when the SEC-DED-S14ED code is applied to the 16-channel byte B0, the SEC-DED-S14ED code can detect if any one of the 16-byte B0 is wrong. . In addition, for example, if it is separately found that the error occurs in B0 of channel 1, the SEC-DED-S14ED code can specify the error position generated in B0 of channel 1.
- the error control system uses the error occurrence flag that can be set as a result of the decoding process of CODE_H, which is the second code, to specify the memory chip in which the failure has occurred. Further, the error control system corrects the bytes output from the memory chip included in each CODE_V code word according to the generated syndrome.
- FIG. 13 shows a first example of decoding processing by the error check code decoding unit 715.
- the syndrome generation unit 1212 receives the CODE_H of C0 from the input data in each channel that combines the first cycle data received from the stacked memory held in the buffer 1211 and the second cycle data received from the stacked memory 100.
- a syndrome in CODE_H is generated for each codeword in CODE_H, that is, data excluding 2 bits (1 bit ⁇ 2) that are not to be encoded by.
- the syndrome generation unit 1212 transmits the generated syndrome to the syndrome decoding unit 1214 (S1101).
- the syndrome decoding unit 1214 determines the presence / absence of an error and the location where the error occurred in the channel based on the received syndrome value, and transmits the determination result to the error correction unit 1213 (S1102). In the SEC-DED code, the syndrome decoding unit 1214 determines that there is no error in the codeword when the value of the syndrome is 0. Further, when the syndrome value matches any value of the column vector in the parity check matrix, the syndrome decoding unit 1214 determines that a 1-bit error has occurred. If the syndrome value is not 0 and does not match any column vector in the parity check matrix, the syndrome decoding unit 1214 determines that an uncorrectable 2-bit error has occurred.
- the SEC-DED code can probabilistically detect an error of 3 bits or more. That is, even for an error of 3 bits or more, it is probable that the syndrome is not 0 and does not match any column vector of the check matrix. In this case, the syndrome decoding unit 1214 determines that an uncorrectable error has occurred, similarly to the 2-bit error.
- the syndrome decoding unit 1214 does not detect an error (S1102: no error)
- the process proceeds to step S1105 described later.
- the error correction unit 1213 does not correct the input data in the channel, and transmits the data as it is to the CODE_V decoding processing unit 1220 as intermediate data in the channel.
- the syndrome decoding unit 1214 detects a 1-bit error (S1102: 1-bit error)
- the error correction unit 1213 corrects the 1-bit error.
- Each syndrome calculated as the product of a check matrix of a code having 1-bit error correction capability and a code word of 1-bit error satisfies the property that all syndrome values are different for all 1-bit error patterns. Accordingly, the syndrome decoding unit 1214 can uniquely determine the position of the error occurrence bit for the 1-bit error based on the syndrome value.
- the error correction unit 1213 corrects the data by inverting the bit at the uniquely determined bit position of the input data in the channel (S1103). At this time, the error correction unit 1213 transmits the corrected data to the CODE_V decoding processing unit as intermediate data in the channel.
- the syndrome decoding unit 1214 cannot uniquely identify an erroneous bit position from the syndrome value. Therefore, the error correction unit 1213 cannot correct the error.
- the syndrome decoding unit 1214 detects an error of 2 bits or more (S1102: uncorrectable error detection), it sets an error occurrence flag indicating that an error of 2 bits or more has occurred in the channel, and the error occurrence flag in the channel
- the signal is transmitted to the CODE_V decoding processing unit 1220 (S1104).
- the error correction unit 1213 does not correct the input data in the channel but transmits the data as it is to the CODE_V decoding processing unit 1220 as intermediate data in the channel.
- the CODE_H decoding processing unit 1210 performs the processing in steps S1101 to S1104 for all codewords in all channels, that is, CODE_H.
- the above is the decoding processing by the CODE_H decoding processing unit 1210.
- decryption processing by the CODE_V decryption processing unit 1220 will be described.
- the code CODE_V is applied to 16 channels of data including two stacked memories.
- the syndrome generation unit 1221 generates a syndrome in CODE_V for data obtained by collecting the bytes of intermediate data for 16 channels output from the CODE_H decoding processing unit 1210 (S1105).
- the syndrome generation unit 1221 transmits the generated syndrome to the syndrome decoding unit 1223.
- the syndrome decoding unit 1223 performs error detection and correction on the byte using the syndrome generated by the syndrome decoding unit 1223 (S1106). At this time, the error occurrence flag checking unit 1224 detects that the error occurrence flag is set based on the error occurrence flag signal received from the CODE_H decoding processing unit 1210.
- the syndrome decoding unit 1223 in the SEC-DED-S14ED code, according to the generated syndrome, similar to the syndrome decoding unit 1214 using the SEC-DED code, whether or not an error has occurred, 1-bit error correction, and 2-bit error detection is performed.
- the syndrome decoding unit 1223 can detect a byte error having a byte length of 14 in the SEC-DED-S14ED code.
- the error check code decoding unit 715 normally ends the decoding process on the data in which the bytes are collected (S1110).
- the error correction unit 1222 corrects the 1-bit error (S1107). Subsequently, the error check code decoding unit 715 normally ends the decoding process on the data in which the bytes are collected (S1110).
- the error occurrence flag checking unit 1224 receives the error occurrence flag detection signal received from the syndrome decoding unit 1214 Based on the above, it is checked whether the set error occurrence flag is one place (S1108).
- the error occurrence flag checking unit 1224 determines that the set error occurrence flag is one (S1108: YES), it determines that a byte error has occurred in the channel in which the error occurrence flag is set. At this time, the error occurrence flag checking unit 1224 transmits an error occurrence channel signal including information on the channel to the error correction unit 1222. The error correction unit 1222 corrects the byte value output from the channel indicated by the error occurrence channel signal based on the syndrome value (S1109). Subsequently, the error check code decoding unit 715 normally ends the decoding process (S1110).
- the error occurrence flag checking unit 1224 determines that the set error occurrence flag is not one place (S1108: No), there is a possibility that an error has occurred in a plurality of channels. Accordingly, the error occurrence flag checking unit 1224 cannot uniquely identify which channel is faulty, so the error correction unit 1222 cannot correct an error. In this case, the error occurrence flag checking unit 1224 sets an uncorrectable error detection signal indicating that an uncorrectable error has been detected, and ends error detection in the data in which the relevant byte is collected (S1111).
- the CODE_V decoding processing unit 1220 performs the processing in steps S1105 to S1111 for all codewords in CODE_V. The above is the decoding processing by the CODE_V decoding processing unit 1220.
- the CODE_H decoding processing unit 1210 may not be able to detect the error even if an error has occurred. . At this time, no error occurrence flag is set, and even if the CODE_V decoding processing unit 1220 detects a byte error, the error cannot be corrected. At this time, in step S1108, the error occurrence flag checking unit 1224 may set an uncorrectable error detection signal.
- the uncorrectable error detection signal is sent to the memory controller 710, for example.
- the memory controller 710 may leave a record of errors, for example, by setting a value in a register indicating that an uncorrectable error has occurred.
- the operating system may take measures such as restarting the system using error recording or excluding the memory address from the page allocation target.
- the error occurrence flag checking unit 1224 may notify not only that an uncorrectable error has occurred but also the address information on which the uncorrectable error has occurred, for example, to the memory controller 710 or the like.
- the processing when the occurrence of these uncorrectable errors is detected can be determined by individual system design and is not limited to the processing described above.
- FIG. 14 shows a second example of the decoding process performed by the error check code decoding unit 715. Only the differences between FIG. 14 and FIG. 13 will be described.
- the error occurrence flag checking unit 1224 sets the error occurrence flag. It is checked whether it has not been done (S2812). That is, the syndrome decoding unit 1223 performs a check in step S2812 when there is no error in CODE_V.
- the error check code decoding unit 715 normally ends the decoding process on the byte (S1110).
- the error occurrence flag checking unit 1224 sets an uncorrectable error detection signal indicating that an uncorrectable error is detected, and an error in the data in which the bytes are collected. The detection ends (S1111).
- the CODE_V decoding processing unit 1220 can detect the occurrence of an error according to the setting state of the error occurrence flag even when an error that can be detected only probabilistically cannot be detected. .
- FIG. 15 shows a first example of an error pattern when a memory chip in the stacked memory fails.
- the memory chip in which the channel 6 (CH6) in the stacked memory is mounted has failed, and all bytes of the channel 6 are incorrect.
- the syndrome decoding unit 1214 transmits an error occurrence flag signal indicating that the channel 6 is in error to the CODE_V decoding processing unit 1220.
- step S1108 the error occurrence flag checking unit 1224 determines that a byte error has occurred in the channel 6, and transmits an error occurrence channel signal to the error correction unit 1222.
- step S1109 the error correction unit 1222 corrects the byte value output from the channel indicated by the error occurrence channel signal in each data based on the syndrome value.
- the CODE_H decoding processing unit 1210 detects the byte error again stochastically.
- the CODE_V decoding processing unit 1220 may include a storage element that holds information indicating that each chip has failed.
- the CODE_V decoding processing unit 1220 writes, for example, a value indicating that the chip is defective at the time of byte error correction to the storage element.
- the error occurrence flag checking unit 1224 checks the error occurrence flag by combining the failure chip information and the error occurrence flag signal output by the CODE_H decoding processing unit 1210 when the second or subsequent byte error is detected after the byte error correction. Do.
- FIG. 16 shows a second example of an error pattern when a memory chip in the stacked memory fails.
- the CODE_V decoding processing unit 1220 cannot uniquely identify an error occurrence chip, and thus cannot correct an error caused by a memory chip failure.
- the memory chip 6 on which the channel 6 is mounted is faulty, and at the same time, two or more bits (B1 and B5) of the data in the channel 2 are incorrect.
- the CODE_H decoding processing unit 1210 since the CODE_H decoding processing unit 1210 detects an error of 2 bits or more in each of the channel 2 and the channel 6, it sets an error occurrence flag indicating that the two channels of the channel 2 and the channel 6 are errors.
- the error occurrence flag checking unit 1224 may have an error in two of the channel 2 and the channel 6, so which channel is out of order. Cannot be uniquely identified and the error cannot be corrected. In this case, in step S1111, the error occurrence flag checking unit 1224 sets an uncorrectable error detection signal indicating that an uncorrectable error has been detected.
- FIG. 17 shows a third example of an error pattern when a memory chip in the stacked memory fails.
- B1 of channel 2 and channel 6 is incorrect.
- the crosses in FIG. 17 represent byte errors.
- the CODE_H decoding processing unit 1210 detects an error of at least one of the channel 2 and the channel 6 using the SEC-DED code and sets an error occurrence flag in the channel where the error is detected.
- B1 since B1 has a 2-byte error, the CODE_V decoding processing unit 1220 can only detect the error in the SEC-DED-S14ED code only probabilistically. Therefore, the CODE_V decoding processing unit 1220 may not be able to detect the error.
- the error occurrence flag checking unit 1224 confirms the error occurrence flag, that is, the error detection is performed by performing the process of step S2812. Ability can be improved.
- the error control system of the present embodiment detects and corrects a byte error by estimating a memory chip that is likely to have a memory chip failure with the code CODE_H when a byte error is detected with the code CODE_V. It can be carried out. As a result, the error control system of this embodiment can detect and correct a long bit error due to a failure of a memory unit or the like.
- the error control system of the present embodiment uses a product code configured such that an overlapping part of a code word in an arbitrary CODE_H and a code word in an arbitrary CODE_V is equal to or less than the detection length of a byte error in CODE_V. Perform error control.
- the CODE_H decoding processing unit 1210 detects an uncorrectable error by performing error control using the code
- the CODE_V decoding processing unit 1220 uniquely identifies the channel in which the error has occurred. Can be confirmed.
- FIG. 18 shows an example of an error pattern in 2-cycle data output from the same channel.
- CODE_H is applied to 2-cycle data output from the same channel.
- the number of bit errors in the data for two cycles is even when the number of bit errors in the channel 130 is an even number (CASE 1) and the case of an odd number (CASE 2).
- a code called an odd-weighted SEC-DED code in which the weights of all the column vectors of the check matrix are odd numbers can detect an even number of errors in the data with high probability.
- the vector weight is the number of non-zero elements of the vector. Therefore, as in this embodiment, for example, when a code is applied to data that passes through the same data path such as the same channel for an even number of cycles, an error can be detected with high probability by applying an odd weight code. it can.
- CODE_H has an error detection function
- CODE_V is a code that can specify that the error has occurred in a specific memory chip (specific channel) using the detection result of CODE_H. That's fine.
- CODE_H may be a single parity check code.
- the single parity check code is a code that uses a value calculated by XOR of bits constituting data as one check bit. That is, the single parity check code can detect an odd number of errors in the codeword.
- the syndrome decoding unit 1214 sets an error occurrence flag of the channel when an error is detected. That is, when CODE_H is a single parity check code, a method similar to the correction method when CODE_H is a SEC-DED code can be applied.
- the error check code encoding unit 713 does not need to collect check data for two cycles and collect one cycle. CODE_H code processing can be applied to one channel of data.
- CODE_H may be, for example, a checksum or a cyclic redundancy check code (CRC).
- CODE_V may be a code that can detect an error such as a memory chip, a memory unit corresponding to a channel, or a bank unit.
- CODE_V may be, for example, a SEC-DED-SbED code (b is an arbitrary positive integer).
- the SEC-DED-SbED code is a code that has the function of the SEC-DED code and can detect a one-byte error when the byte length is b bits.
- FIG. 19 shows a second example of the byte division format of the channel.
- B0 to B3 are 28 bits of data and 2 bits of check bits
- B4 is 28 bits of data bits and 2 bits of check bits
- C0 is a check bit of 5 bits of CODE_H.
- 1 check bit Note that as the byte length b in the SEC-DED-SbED code is shorter, the error detection and correction capability in the entire memory chip is improved, and the scale of the encoding and decoding circuits is reduced.
- FIG. 20 shows an example of a channel configuration in which two channels are arranged in each memory chip.
- Channel 0 (130) and channel 1 (131) are arranged in the memory chip 0 (110).
- a failure occurs in the entire memory chip 0 (110) such as a power supply or a clock, two channels fail.
- FIG. 21 shows an example of code application in a memory chip in which two channels are arranged.
- the product code of CODE_H and CODE_V is applied to the four stacked memories 0 to 3 (100 to 103).
- a method is conceivable in which an output byte from each memory chip included in CODE_V is 1 byte.
- CODE_V is applied to data in memory units corresponding to CH_0, CH_2, CH_4,..., CH_30.
- CODE_V is applied to each byte set obtained by dividing the channel in FIG.
- CODE_H is applied to 2-cycle data in the memory unit corresponding to the same channel. If CODE_V and CODE_H are configured as shown in FIG. 21, the number of stacked memories required to apply CODE_V and CODE_H is increased, but the memory access granularity is the same as the code illustrated in FIG.
- FIG. 22 shows an example of code application in a stacked memory including a four-layer memory chip in which two channels are arranged.
- CODE_H and CODE_V are applied to the stacked memory 0 (100) and the stacked memory 1 (101).
- CODE_V is applied to data in memory units corresponding to CH_0, CH_2,..., CH_14, CH_1, CH_3,.
- CODE_H is applied to two cycles of data in the memory unit corresponding to the same channel.
- a memory unit including bytes encoded by CODE_V includes memory units corresponding to all channels configured in the same memory chip.
- the failure of the memory chip does not cover the entire memory chip such as the power supply and the clock, but is limited to one channel in the memory chip, for example, the failure of the address decoder. With respect to error control for failure, the method of this embodiment is applicable.
- FIG. 23 shows an example of code application assuming that a permanent failure of a memory chip occurs in a specific bank in a channel. For example, a sense amplifier failure in the bank is applicable in this case.
- the same method can be configured by applying the configuration applied to the channel in this embodiment to the bank. That is, when the channel is composed of 8 banks, the same method can be realized by applying the code CODE_V to, for example, a data set obtained from 16 banks of 2 channels as shown in FIG. That is, CODE_V is applied with each bank as a memory unit. At this time, CODE_H is applied to the data of the same channel and the same bank for two cycles.
- the method of applying codes to the data collected in units of banks as shown in FIG. 23 can effectively utilize the channel level parallelism because it does not occupy the channels. It is. Therefore, the code application example shown in FIG. 23 is suitable for a system that requires channel level parallelism.
- the memory controller 710 further includes a storage mechanism.
- the storage mechanism may be included in the error check code decoding unit 715 or may be included in the read control unit 714, for example.
- a read request by the processor 720 or the DMA control unit 730 is smaller than 4,096 bits.
- the cache line size on many processors is 512 bits (64 bytes) or 1024 bits (128 bytes), so one memory access that occurs on a cache miss may be less than 4,096 bits .
- the read control unit 714 once reads out 4,096-bit encoded data including, for example, 512-bit (or 1,024-bit) data requested by the processor 720.
- the error check code decoding unit 715 performs a decoding process on the read data, and then the storage mechanism temporarily caches a portion that is not originally requested data.
- the read control unit 714 When the read control unit 714 receives a read request from the processor 720 or the DMA control unit 730, the read control unit 714 first checks whether the requested data is stored in the cache. If the requested data is stored in the cache, the data is transmitted from the cache. In general, reading to a memory often has locality in terms of time and space, so that the cache memory holds data, thereby improving the performance at the time of reading.
- the read control unit 714 needs to read the remaining bits constituting the code. For example, when the processor chip 700 does not have a cache and the last updated data is stored only in the stacked memory 100, the read control unit 714 reads out other part of the data from the stacked memory 100, and the error check code The encoding unit 713 applies a code together with the other parts.
- the write control unit 712 may have a function of sending a read request to the read control unit 714 and may include a buffer that temporarily stores received data.
- the write control unit 712 when the data update request source to the stacked memory 100 includes a cache, when the last updated data is stored in a memory other than the stacked memory 100, the write control unit 712 is updated last. It has a function of reading data from a memory including data, and after reading other data necessary for encoding from the memory, it is encoded together with the data.
- writing to the memory has locality in time and space as well as reading.
- the write control unit 712 does not immediately write data to the stacked memory 100 when a write request is accepted, but temporarily buffers the data, thereby improving the performance during writing.
- the CODE_V decoding processing unit 1220 when the CODE_H decoding processing unit 1210 determines that the error cannot be corrected, the CODE_V decoding processing unit 1220 is identical in order to uniquely identify in which memory chip the error has occurred. The code was applied to the data for two cycles output from the memory chip.
- FIG. 24 shows an example of code application using four stacked memories.
- the data amount is expanded in the time direction in order to secure the check bits necessary for CODE_H, and the code is applied to the data for two cycles.
- the error control system of this embodiment is the same as that of the first embodiment by extending the data amount in the spatial direction to the required number of check bits using a system in which stacked memories 0 to 3 (100 to 103) are mounted. Implement error control.
- CODE_V is applied to a 16-channel data set output from two stacked memories, and the data is a combination of 16-channel channels combined with CODE_V and another stacked memory channel.
- CODE_H is applied.
- CODE_V is applied to 16 channels of the stacked memories 0 and 1 (100 and 101). Further, CODE_H is obtained for data combining the channels of the stacked memory 0 (100) and the stacked memory 2 (102) and data combining the channels of the stacked memory 1 (101) and the stacked memory 3 (103). Applies.
- the error control method in this embodiment can perform error detection and error correction by the same flow as in the first embodiment.
- CODE_V When CODE_V is applied to a memory unit corresponding to a channel as shown in FIGS. 21 and 22, a code can be applied to data of one cycle of four stacked memories in the same manner as in FIG. . Also, as shown in FIG. 23, when CODE_V is applied to a bank, a code can be applied to data in one cycle of four memory chips in the same manner as in FIG.
- FIG. 25 shows a channel configuration example in the HMC.
- channels 530 are distributed in a plurality of memory chips 510, and output bits from each memory chip 510 use the same TSV set in a time division manner.
- an HMC in which four memory chips 510 are stacked outputs a total of 36 bits, 32 bits of data bits and 4 bits of check bits, from each memory chip 510, and includes 128 bits of data bits and 16 bits of check bits for the four layers.
- FIG. 26 shows an example of an error pattern when a TSV failure occurs in the HMC.
- the TSV connecting the memory chips 510 fails due to wear or the like.
- the output bits from each memory chip 510 share the TSV, a plurality of bits passing through the faulty TSV may be erroneous.
- the error check code encoding unit 713 and the error check code decoding unit 715 store data for each TSV that passes each bit. Rearranges.
- FIG. 27 is an example showing error patterns before and after the rearrangement. By this rearrangement, the error check code encoding unit 713 can handle 4-bit errors that occur in a jump as a single 4-bit error.
- the error check code encoding unit 713 uses, for example, a byte having a length of 4 to give a 16-bit check bit to 128-bit data, which is used for 4-bit byte error correction in a conventional x4 DIMM or the like.
- an S4ED-D4ED Single 4-bit Error Detection-Double 4-bit Error Detection
- the error control system can detect and correct a 4-bit error that occurs when one TSV failure occurs and can detect an 8-bit error that occurs when two TSV failures occur by performing the rearrangement in FIG. It becomes.
- the error control system can apply the method not only to the TSV but also to the code for a lump of data that uses the same hardware resource in a time-sharing manner.
- this invention is not limited to the above-mentioned Example, Various modifications are included.
- the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described.
- a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment.
- each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
- Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
- Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.
- control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
La présente invention concerne un dispositif de correction d'erreurs qui lit, à partir d'un dispositif de mémoire, des premières données qui ont été codées par un procédé de codage à l'aide d'un code de produit comprenant un code d'un premier type et un code d'un second type et dans lequel la longueur du chevauchement de données entre un mot codé du premier type qui a été codé à l'aide du code du premier type et d'un mot codé du second type qui a été codé à l'aide du code du second type est inférieure ou égale à la longueur d'octets du code du second type. Le dispositif de correction d'erreurs effectue ensuite un décodage sur les premières données à l'aide du code du second type et si une erreur incorrigible est détectée dans un premier mot codé du second type, le dispositif de correction d'erreurs établit un indicateur d'erreur, effectue le décodage, à l'aide du code du premier type, sur les premières données sur lesquelles le décodage à l'aide du code du second type a été réalisé et corrige une erreur dans un premier octet du premier mot codé du second type en fonction de l'indicateur d'erreur à l'aide d'un premier mot codé du premier type qui comprend le premier octet du premier mot codé du second type et qui est constitué d'une pluralité d'octets, chacun étant mémorisé dans une unité de mémoire différente dans le dispositif de mémoire.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2014/073760 WO2016038673A1 (fr) | 2014-09-09 | 2014-09-09 | Dispositif de correction d'erreurs, procédé de correction d'erreurs et système de correction d'erreurs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2014/073760 WO2016038673A1 (fr) | 2014-09-09 | 2014-09-09 | Dispositif de correction d'erreurs, procédé de correction d'erreurs et système de correction d'erreurs |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016038673A1 true WO2016038673A1 (fr) | 2016-03-17 |
Family
ID=55458466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/073760 WO2016038673A1 (fr) | 2014-09-09 | 2014-09-09 | Dispositif de correction d'erreurs, procédé de correction d'erreurs et système de correction d'erreurs |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016038673A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7488989B2 (ja) | 2020-05-29 | 2024-05-23 | 公立大学法人会津大学 | 複数のtsvを含むtsvグループが層間を接続するオンチップの3次元システム |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012022422A (ja) * | 2010-07-13 | 2012-02-02 | Panasonic Corp | 半導体記録再生装置 |
-
2014
- 2014-09-09 WO PCT/JP2014/073760 patent/WO2016038673A1/fr active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012022422A (ja) * | 2010-07-13 | 2012-02-02 | Panasonic Corp | 半導体記録再生装置 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7488989B2 (ja) | 2020-05-29 | 2024-05-23 | 公立大学法人会津大学 | 複数のtsvを含むtsvグループが層間を接続するオンチップの3次元システム |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10922172B2 (en) | On the fly raid parity calculation | |
EP2972871B1 (fr) | Procédés et appareil de détection et de correction d'erreur dans des systèmes de stockage de données | |
US6973613B2 (en) | Error detection/correction code which detects and corrects component failure and which provides single bit error correction subsequent to component failure | |
US6976194B2 (en) | Memory/Transmission medium failure handling controller and method | |
US6996766B2 (en) | Error detection/correction code which detects and corrects a first failing component and optionally a second failing component | |
US8185800B2 (en) | System for error control coding for memories of different types and associated methods | |
WO2018137370A1 (fr) | Synchronisation d'identification de transaction | |
US20140068319A1 (en) | Error Detection And Correction In A Memory System | |
US8181094B2 (en) | System to improve error correction using variable latency and associated methods | |
US8140945B2 (en) | Hard component failure detection and correction | |
US9183078B1 (en) | Providing error checking and correcting (ECC) capability for memory | |
US20100287445A1 (en) | System to Improve Memory Reliability and Associated Methods | |
US7188296B1 (en) | ECC for component failures using Galois fields | |
JP2001249854A (ja) | メモリ設計のための共有式誤り訂正 | |
US9898365B2 (en) | Global error correction | |
US9665423B2 (en) | End-to-end error detection and correction | |
US8185801B2 (en) | System to improve error code decoding using historical information and associated methods | |
US9690649B2 (en) | Memory device error history bit | |
US20160139988A1 (en) | Memory unit | |
WO2016122515A1 (fr) | Code de correction d'erreur à somme de contrôle multiple d'effacement | |
US20160147598A1 (en) | Operating a memory unit | |
JP7249719B2 (ja) | 共通の高ランダム・ビット・エラーおよび低ランダム・ビット・エラー修正ロジック | |
US6460157B1 (en) | Method system and program products for error correction code conversion | |
WO2016038673A1 (fr) | Dispositif de correction d'erreurs, procédé de correction d'erreurs et système de correction d'erreurs | |
CN106021012B (zh) | 数据处理电路 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14901793 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14901793 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |