WO2016038673A1

WO2016038673A1 - Error correction device, error correction method, and error correction system

Info

Publication number: WO2016038673A1
Application number: PCT/JP2014/073760
Authority: WO
Inventors: 忠幸松村; 田中　剛
Original assignee: 株式会社日立製作所
Priority date: 2014-09-09
Filing date: 2014-09-09
Publication date: 2016-03-17

Abstract

This error correction device reads, from a memory device, first data which has been encoded by an encoding process using a product code including a code of a first type and a code of a second type, and in which the length of the data overlap between a codeword of the first type that has been encoded using the code of the first type and a codeword of the second type that has been encoded using the code of the second type is equal to or less than the byte length of the code of the second type. The error correction device then performs decoding on the first data using the code of the second type, and if an uncorrectable error is detected in a first codeword of the second type, the error correction device sets an error flag, performs decoding, using the code of the first type, on the first data on which the decoding using the code of the second type has been performed, and corrects an error in a first byte of the first codeword of the second type on the basis of the error flag using a first codeword of the first type that includes the first byte of the first codeword of the second type and that consists of a plurality of bytes, each stored in a different memory unit in the memory device.

Description

Error correction apparatus, error correction method, and error correction system

The present invention relates to an error correction device, an error correction method, and an error correction system.

FIG. 1 shows a configuration example of an 8-layer 8-channel stacked memory. In recent years, for the purpose of further widening the memory bandwidth, a stacked memory 100 in which a plurality of memory chips 110 are connected to each other by a TSV (through silicon via) 140 as shown in FIG. For example, HBM (High Bandwidth Memory) whose specifications are specified by JEDEC and HMC (Hybrid Memory Cube) whose specifications are specified by the HMC Consortium are expected to be widely used as a stacked memory capable of realizing a wider band than a conventional DDR3 memory.

The stacked memory 100 may include not only the memory chip 110 but also a control chip 120 as shown in FIG. An interface between the stacked memory 100 and the outside is called a channel 130. The stacked memory 100 may include a plurality of channels 130. FIG. 1 illustrates an example in which eight channels 130 are mounted on the stacked memory 100 including eight memory chips 110 and one control chip 120. .

On the other hand, in large-scale computer systems equipped with a large number of memories, such as the High Performance Computing (HPC) field and data centers, memory failures occur at a frequency that cannot be ignored, so applications that require high reliability of the system Therefore, memory fault tolerance technology is essential.

There are the following two types of memory failures. The first type of failure is a transient failure in which data in the memory is temporarily destroyed when, for example, neutrons or α rays collide and pass through the memory chip. The second type of failure is a permanent failure in which the circuit cannot satisfy a desired function due to, for example, circuit wear or the like, and data is permanently destroyed after the failure occurs. Transient faults are also called soft errors.

In many cases, one bit is erroneous due to a soft error. For this reason, error detection and soft error countermeasures using correction codes that can detect and correct an arbitrary 1-bit error in data by redundantly adding check bits to data have been performed so far. It was. The error correction code and the error detection code are collectively referred to as an error control code.

For example, by adding an 8-bit check bit to 64-bit data, an error occurring in any 1 bit within the total 72 bits of the data and check bits is detected and corrected, and at the same time, any 2 in the 72 bits A 1-bit error correction-2 bit error detection code (SEC-DED code: Single Error Correction-Double Error Detection Code) capable of detecting an error occurring in a bit is widely known.

FIG. 2 shows an example of the channel format of HBM. In the HBM specification, a channel that is an interface to a memory has a configuration in which a 16-bit check bit 220 can be added to a 128-bit data bit 210. FIG. 3 shows an example of an HMC channel format. The HMC includes four sets of 32-bit data bits 310 and 4-bit check bits 320, and has a configuration in which a total of 128-bit data bits and a total of 16-bit check bits are combined. Therefore, in both the HBM and HMC examples, 128 bits of data and 16 bits of check bits are considered as two sets of 64 bits of data and 8 bits of check bits. It is possible to apply a method similar to the conventional method.

On the other hand, permanent failures are different from soft errors, and in many cases, multiple bits fail simultaneously. For example, when a part of the row address decoder fails, there is a possibility that a batch of data read / written for memory access to the failed row is erroneous. For this reason, byte error detection and correction codes capable of detecting and correcting errors are used as permanent failure countermeasure techniques even when a plurality of bits in a batch are erroneous at the same time. Here, the byte is a unit composed of a plurality of consecutive bits, and the number of bits constituting the byte is called a byte length.

FIG. 4 shows a configuration example of x4 DIMM. For example, a DIMM 400 (Dual Inline Memory Module) combines output bits from a plurality of memory chips 410 mounted on the DIMM 400 as shown in FIG. 4 to form a desired data width. However, when a failure due to the influence of the entire memory chip 410 occurs due to a failure of the row address decoder, a failure of the power supply circuit, or the like, a plurality of bits output from the failed memory chip are failed.

For example, as shown in FIG. 4, a DIMM 400 (x4-DIMM) that outputs 4 bits from each memory chip 410 constitutes 64-bit data by collecting output bits from 16 memory chips 410. In this case, when a failure occurs in the memory chip 410, a 4-bit block output from the failed memory chip out of 64-bit data is erroneous.

Therefore, a byte error control positive code capable of detecting and correcting these byte errors is applied to a permanent failure of a memory chip failure. For example, in the case of an error with a byte length of 4, an arbitrary 1-byte error in a total of 144 bits of data and check bits is corrected by adding a 16-bit check bit to 128-bit data. A 1-byte error correction code that can detect byte errors and a 2-byte error detection code (S4EC-D4ED code: Single 4-bit Error Correction-Double 4-bit Error Detection Code) are known.

Further, the following are known as error control codes. Patent Document 1 describes a cross-interleaved Reed-Solomon code (CIRC) that forms a code having higher error control capability than a case where each code is applied independently by applying two codes in combination. Yes. Non-Patent Document 1 discloses a specific configuration method of the SEC-DED-SbED code that has an error control capability equivalent to that of the SEC-DED code and can detect a byte error having a byte length of b bits. Are listed.

U.S. Pat. No. 4,413,340

For example, in the stacked memory 100 used in the high-reliability application field such as the HPC field and the data center, it is desirable to apply the permanent failure countermeasure of the memory unit or channel. However, the stacked memory 100 such as an HBM or HMC is provided as a module in which a plurality of memory chips 110 are stacked by the TSV 140, and a computer chip for the computer system designer to separately output a test bit cannot be added.

In addition, the bit widths of the data bits and check bits in the channel 130 that are interfaces with the stacked memory 100 are defined in advance by the specifications, and the computer system designer cannot increase the check bits in the channel. Therefore, when an error control code is applied as a countermeasure against a permanent failure of a memory unit or channel in a stacked memory 100 such as an HBM or an HMC, only codes that can be configured with the number of check bits determined in advance may be applied. Can not.

On the other hand, when each channel 130 is built in each memory chip 110, for example, when a permanent failure occurs in the memory chip 110 that is an example of the memory unit, 128 bits of data output from the channel corresponding to the memory chip 110, All of 144 bits in total including 16 check bits are faulty. That is, in this case, it is necessary to handle an error having a byte length longer than that of the conventional DIMM 400. More check bits are required for error detection and control of a long byte length. Therefore, it is impossible to apply the permanent failure countermeasure technique in the conventional DIMM, the cross interleaved Reed-Solomon code described in Patent Document 1, the SEC-DED-SbED code described in Non-Patent Document 1, and the like.

As described above, there are the following problems in realizing a fault-tolerant technique for a permanent failure of a memory unit or channel in a memory device. The first problem is that the bit widths of data bits and check bits output from each memory unit are determined in advance, and new bits cannot be added as in the conventional DIMM 400. The second problem is that when a permanent failure occurs, the number of erroneous bits is larger than that of the conventional DIMM 400 or the like.

This proposal was devised in view of the above problems, and discloses a configuration and method for detecting and correcting an error caused by a permanent failure of a memory unit or a channel in a memory device.

In order to solve the above problems, the present invention employs the following configuration, for example. An error correction device for reading data from a memory device and correcting an error in the read data, wherein the memory device is encoded by a code process in a product code of a first type code and a second type code In the first data, an arbitrary first type codeword encoded by the first type code and an arbitrary second type codeword encoded by the second type code. The data length that overlaps with the second type code is less than or equal to the byte length, and the error correction device reads the first data from the memory device and the read first data A first decoding unit that performs a decoding process on the second type code, and a first decoding unit that performs a decoding process on the first type code on the first data that has been decoded by the second decoding unit. Decryption process And when the second decoding processing unit detects an uncorrectable error in the first type 2 codeword acquired from the memory device, the first type 2 codeword generates an error. An error flag indicating inclusion is set, and the first decoding processing unit includes a first byte included in the first type 2 codeword and is stored in each of different memory units in the memory device. An error correction device for correcting an error in the first byte based on the error flag in a first type 1 codeword composed of bytes.

According to one aspect of the present invention, erroneous data output from a failure location can be detected with high accuracy when a memory unit or channel failure occurs in the memory device. As a result, it is possible to detect with high accuracy that the output of the program executed by the information processing apparatus including the memory device is incorrect. Moreover, it becomes possible to increase the average continuous operation time of the program by correcting the error.

It is a block diagram which shows the structural example of the laminated memory of 8 layers 8 channels. It is a figure which shows the example of the channel format of HBM. It is a figure which shows the example of the channel format of HMC. It is a figure which shows the structural example of x4 DIMM. In Example 1, it is a block diagram which shows the structural example of an error correction system. In Example 1, it is a block diagram which shows the structural example of the memory chip by which 1 channel is arrange | positioned. In Example 1, it is explanatory drawing which shows an example of an error control code. In Example 1, it is a figure which shows the 1st example of the byte division | segmentation format of a channel. It is a figure which shows the comparative example of an error control code. In Example 1, it is a block diagram which shows the structural example of an error check code | cord decoding part. In Example 1, it is a block diagram which shows the structural example of a CODE_H decoding process part. In Example 1, it is a block diagram which shows the structural example of a CODE_V decoding process part. In Example 1, it is a flowchart which shows the 1st example of a decoding process. In Example 1, it is a flowchart which shows the 2nd example of a decoding process. In Example 1, it is a figure which shows the example of the 1st error pattern when the memory chip in a laminated memory fails. In Example 1, it is a figure which shows the example of the 2nd error pattern when the memory chip in a laminated memory fails. In Example 1, it is a figure which shows the example of the 3rd error pattern when the memory chip in a laminated memory fails. In Example 1, it is a figure which shows the example of the error pattern in the data of 2 cycles output from the same channel. In Example 1, it is a figure which shows the 2nd example of the byte division | segmentation format of a channel. FIG. 3 is a block diagram illustrating a configuration example of a memory chip in which two channels are arranged in the first embodiment. In Example 1, it is a figure which shows the 1st example of the error control code applied when 2 channels are arrange | positioned at 1 memory chip. In Example 1, it is a figure which shows the 2nd example of the error control code applied when 2 channels are arrange | positioned at 1 memory chip. In Example 1, it is a figure which shows the example of application of an error control code when a bank failure is assumed. In Example 2, it is a figure which shows the example of an error control code application using four laminated memories. In Example 3, it is a figure which shows the channel structural example of HMC. In Example 3, it is a figure which shows the example of the error pattern by TSV failure. In Example 3, it is a figure which shows the example of the error pattern before and behind the data rearrangement which fixes the bit which becomes an error by TSV failure to a specific byte.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each figure, the same reference numerals are given to common configurations.

FIG. 5 shows a configuration example of the error control system of this embodiment. For example, the error control system includes a stacked memory 100 and a processor chip 700 connected to the stacked memory 100. The stacked memory 100 is an example of a memory device, and has a configuration similar to that of FIG. The processor chip 700 includes a memory controller 710, a plurality of processors 720, and a DMA control unit 730.

The memory controller 710 performs data error control and read / write control to the memory from the processor 720 and the DMA control unit 730. The processor 720 operates in accordance with a program, inputs / outputs data, reads / writes data, and executes each program to be described later. The DMA control unit 730 controls communication in DMA transfer.

The memory controller 710 includes, for example, a memory interface 711, a write control unit 712, an error check code encoding unit 713, a read control unit 714, and an error check code decoding unit 715. The memory interface 711 is an interface that inputs and outputs data and the like from the stacked memory 100. The write control unit 712 / read control unit 714 is a program and controls writing / reading of data to / from the stacked memory 100 from the processor 720 and the DMA control unit 730.

The error check code encoding unit 713 includes a program and performs an encoding process on data written to the stacked memory 100. The error check code decoding unit 715 includes a program, performs a decoding process on data read from the stacked memory 100, and performs error detection and error correction. The error control system of the present embodiment is not limited to the configuration of FIG. 5, for example, the memory controller 710 may be configured in the control chip 120 in the stacked memory 100.

The program is executed by the processor 720 to perform a predetermined process using the storage device and the memory interface 711. Therefore, in the present embodiment and other embodiments, the description with the program as the subject may be the description with the processor 720 as the subject. Alternatively, the process executed by the program is a process performed by a computer and a computer system on which the program operates.

The processor 720 operates as a functional unit that realizes a predetermined function by operating according to a program. For example, the processor 720 functions as a write control unit by operating according to the write control unit 712, and functions as a read control unit by operating according to the read control unit 714. The same applies to other programs. Further, the processor 720 also operates as a functional unit that implements each of a plurality of processes executed by each program. A computer and a computer system are an apparatus and a system including these functional units.

Note that at least a part of the program may be realized by dedicated hardware. The program can be installed in each computer by a program distribution server or a computer-readable non-transitory storage medium, and can be stored in a nonvolatile storage device of each computer.

In this embodiment, a case is considered where a channel that is an interface to the stacked memory 100 is composed of 128-bit data and 16-bit data as shown in FIG. FIG. 6 shows a configuration example of channels when one channel is arranged in one memory chip. The stacked memory 100 is composed of eight layers of memory chips and incorporates a total of eight channels. That is, there is a one-to-one correspondence between memory chips and channels.

In FIG. 6, the case where the memory chip 1 (111) fails and the total 144 bits of the data bits 128 bits and the check bits 16 bits input / output when accessing the channel 1 (131) becomes an error is shown. Think. Here, the error means that the data read / written to / from the memory has a value different from that originally expected. An error control code that detects and corrects a 144-bit error by adding a 16-bit check bit to 128-bit data is not known.

FIG. 7 shows an example of an error control code applied in this embodiment. FIG. 8 shows a first example of the byte division format of the channel. The system disclosed in the present embodiment applies a product code based on two error control codes to a data set obtained by collecting a plurality of channels, which is output from a plurality of stacked memories as shown in FIG.

Specifically, the error check code encoding unit 713 first divides the total 144 bits of the data bits and check bits of each channel into, for example, a plurality of bytes (B0 to B9 and C0) as shown in FIG. To do. Of the divided bytes, 10 bytes from B0 to B9 each include 13 bits or 11 bits, and a total of 128 data bits. Each byte from B0 to B9 includes one check bit. The error check code encoding unit 713 configures the first code CODE_V using the check bits.

Each channel includes a byte C0 including 5 check bits in the second code CODE_H and 1 check bit for applying the code CODE_V to the 5 check bits.

For example, the error check code encoding unit 713 divides each channel out of a total of 16 channels CH0 to CH15 included in two stacked memories (stacked memory 0 (100) and stacked memory 1 (101)). Put together the bytes. The error check code encoding unit 713 applies, for example, a SEC-DED-S14ED code as the first code CODE_V to each group of collected bytes.

Specifically, the error check code encoding unit 713 collects 16 bytes B0 from channel CH0 to channel CH15. The error check code encoding unit 713 is disclosed in Non-Patent Document 1, for example, with respect to a total of 208 bits (= 13 bits × 16 ch) of data bits and a total of 16 bits (= 1 bit × 16 ch) of check bits (224, 208) Apply the SEC-DED-S14ED code.

The error check code encoding unit 713 similarly applies the (224, 208) SEC-DED-S14ED code to B1 to B9. Note that since the bit length of the data bit to be encoded is not 13 in B9, the error check code encoding unit 713 applies a shortened code to B9.

That is, the error check code encoding unit 713 applies CODE_V to the bytes included in each of the memory units in the stacked memory 0 (100) and the stacked memory 1 (101). In the example described above, each memory chip is a memory unit.

The error check code encoding unit 713 collects one channel for two cycles (cycle 0 and cycle 1). The error check code encoding unit 713 includes a check bit of 10 bits (5 bits × 2) in total that is a check bit for 2 cycles of C0, and a data bit of 276 bits in total that is data for 2 cycles of B0 to B9 , The second code CODE_H is applied. The error check code encoding unit 713 applies, for example, a (286,276) SEC-DED code as CODE_H.

Subsequently, the error check code encoding unit 713 applies the CODE_V abbreviated code to C0 in the same manner as other bytes. Specifically, the error check code encoding unit 713 uses a total of 80 bits (= 5 bits × 16 ch) as check bits in CODE_H as data bits, and a total of 16 bits (= 1 bit as bits not to be encoded in CODE_H). x16ch) is used as a check bit, and a shortened code of CODE_V is applied.

Here, the SEC-DED code is a code having the ability to correct an arbitrary 1-bit error in a code word and detect an arbitrary 2-bit error. Furthermore, the SEC-DED code can be detected probabilistically even for errors of 3 bits or more. The SEC-DED-S14ED code is a code having the capability of detecting an arbitrary 1-byte error in a code word when the byte length is 14 bits in addition to the capability equivalent to the SEC-DED code described above. It is.

FIG. 9 shows a comparative example in which a code is applied to one cycle of data. The example of FIG. 9 is different from the example of FIG. 7 in that CODE_H is applied to 2-channel data in one cycle. In the example of FIG. 9, there are two bytes of overlap between data encoded by one CODE_H and data encoded by one CODE_V. Therefore, when an uncorrectable error is detected by CODE_H, CODE_V cannot identify the channel in which the error has occurred.

On the other hand, in the example of FIG. 7, the data of two channels is collected in order to secure the number of check bits for applying the SEC-DED code as CODE_H. However, as shown in FIG. Instead of applying CODE_H to CODE_H, CODE_H is applied to two cycles of data output from the same memory chip. By adopting this configuration, an overlapping portion between an arbitrary codeword encoded by CODE_H and an arbitrary codeword encoded by CODE_V is equal to or less than the byte error detection length (1 byte) in CODE_V. Therefore, when an uncorrectable error due to CODE_H is detected, the channel in which the error has occurred can be uniquely determined by decoding in CODE_V.

FIG. 10 shows a configuration example of the error check code decoding unit 715. The error check code decoding unit 715 includes a CODE_H decoding processing unit 1210 that performs a decoding process on the code CODE_H and a CODE_V decoding processing unit 1220 that performs a decoding process on the code CODE_V.

The input data output from each channel of the stacked memories 100 to 101 is first input to the CODE_H decoding processing unit 1210, where the SEC-DED code (CODE_H) is decoded. As illustrated in FIG. 10, the CODE_H decoding processing unit 1210 may perform CODE_H decoding processing on input data output from each channel in parallel.

FIG. 11 shows a configuration example of the CODE_H decoding processing unit 1210. The CODE_H decoding processing unit 1210 includes a buffer 1211 and a syndrome generation unit 1212, an error correction unit 1213, and a syndrome decoding unit 1214, which are programs.

Since the code CODE_H is applied to data for two cycles output from the same channel, the buffer 1211 holds the data one cycle before. Note that the buffer 1211 may be included in the read control unit 714. The syndrome generation unit 1212 generates a syndrome in CODE_H for the input data output from the stacked memory. The linear code has a matrix called a check matrix that defines each code, and the syndrome is a vector value calculated as a product of the check matrix and the code word.

The syndrome decoding unit 1214 determines the presence / absence of an error and the location where the error occurred based on the value of the syndrome generated by the syndrome generation unit 1212. The syndrome decoding unit 1214 transmits an error occurrence flag signal to the CODE_V decoding processing unit 1220 when it is determined that an uncorrectable error has occurred due to the decoding processing by CODE_H. The error correction unit 1213 corrects the error in which the syndrome decoding unit 1214 specifies the occurrence location.

FIG. 12 shows a configuration example of the CODE_V decoding processing unit 1220. The CODE_V decoding processing unit 1220 includes a syndrome generation unit 1221, an error correction unit 1222, a syndrome decoding unit 1223, and an error occurrence flag check unit 1224 which are programs. The syndrome generation unit 1221 generates a syndrome in CODE_V for the intermediate data output from the CODE_H decoding processing unit 1210.

The syndrome decoding unit 1223 determines the presence / absence of an error and the location where the error occurred based on the value of the syndrome generated by the syndrome generation unit 1212. The error correction unit 1222 corrects the error whose location has been identified by the syndrome decoding unit 1223. The error occurrence flag checking unit 1224 uses the error occurrence flag signal received from the CODE_H decoding processing unit 1210 and the determination result by the syndrome decoding unit 1223 to determine the presence / absence of an error and the location where the error has occurred.

Hereinafter, a decoding process for data read from the memory, and a method for detecting and correcting a 144-bit error that occurs when a memory chip failure occurs using the code illustrated in FIG. 7 will be described.

The error control system of this embodiment detects 100% of 144-bit errors that occur when the memory chip of the stacked memory fails due to the S14ED capability of code CODE_V. The 144 bits output from each memory chip are divided into 11 bytes from B0 to C0 as shown in FIG. 8, and are distributed to different CODE_V codewords. B0 to B9 are composed of 14 bits, B9 is composed of 12 bits, and C0 is composed of 6 bits. Each codeword of CODE_V includes data of B0 to C0 in each memory chip. Therefore, the CODE-V SEC-DED-S14ED code capable of detecting a 1-byte error with a byte length of 14 makes it possible to detect a 144-bit error that occurs when a memory chip failure occurs.

The SEC-DED-S14ED code can detect a 1-byte error with a byte length of 14 but cannot correct the error. For this reason, the SEC-DED-S14ED code alone cannot correct a 144-bit error that occurs when a memory chip fails.

On the other hand, the SEC-DED-S14ED code specifies an error position in a byte according to the generated syndrome if it can separately know which byte is an error among a plurality of bytes constituting the code word. It is possible to correct an error in the byte.

For example, when the SEC-DED-S14ED code is applied to the 16-channel byte B0, the SEC-DED-S14ED code can detect if any one of the 16-byte B0 is wrong. . In addition, for example, if it is separately found that the error occurs in B0 of channel 1, the SEC-DED-S14ED code can specify the error position generated in B0 of channel 1.

Therefore, the error control system according to the present embodiment uses the error occurrence flag that can be set as a result of the decoding process of CODE_H, which is the second code, to specify the memory chip in which the failure has occurred. Further, the error control system corrects the bytes output from the memory chip included in each CODE_V code word according to the generated syndrome.

FIG. 13 shows a first example of decoding processing by the error check code decoding unit 715. First, the decoding process by the CODE_H decoding processing unit 1210 will be described. The syndrome generation unit 1212 receives the CODE_H of C0 from the input data in each channel that combines the first cycle data received from the stacked memory held in the buffer 1211 and the second cycle data received from the stacked memory 100. A syndrome in CODE_H is generated for each codeword in CODE_H, that is, data excluding 2 bits (1 bit × 2) that are not to be encoded by. The syndrome generation unit 1212 transmits the generated syndrome to the syndrome decoding unit 1214 (S1101).

The syndrome decoding unit 1214 determines the presence / absence of an error and the location where the error occurred in the channel based on the received syndrome value, and transmits the determination result to the error correction unit 1213 (S1102). In the SEC-DED code, the syndrome decoding unit 1214 determines that there is no error in the codeword when the value of the syndrome is 0. Further, when the syndrome value matches any value of the column vector in the parity check matrix, the syndrome decoding unit 1214 determines that a 1-bit error has occurred. If the syndrome value is not 0 and does not match any column vector in the parity check matrix, the syndrome decoding unit 1214 determines that an uncorrectable 2-bit error has occurred.

Here, the SEC-DED code can probabilistically detect an error of 3 bits or more. That is, even for an error of 3 bits or more, it is probable that the syndrome is not 0 and does not match any column vector of the check matrix. In this case, the syndrome decoding unit 1214 determines that an uncorrectable error has occurred, similarly to the 2-bit error.

Subsequently, when the syndrome decoding unit 1214 does not detect an error (S1102: no error), the process proceeds to step S1105 described later. At this time, the error correction unit 1213 does not correct the input data in the channel, and transmits the data as it is to the CODE_V decoding processing unit 1220 as intermediate data in the channel.

When the syndrome decoding unit 1214 detects a 1-bit error (S1102: 1-bit error), the error correction unit 1213 corrects the 1-bit error. Each syndrome calculated as the product of a check matrix of a code having 1-bit error correction capability and a code word of 1-bit error satisfies the property that all syndrome values are different for all 1-bit error patterns. Accordingly, the syndrome decoding unit 1214 can uniquely determine the position of the error occurrence bit for the 1-bit error based on the syndrome value.

Therefore, the error correction unit 1213 corrects the data by inverting the bit at the uniquely determined bit position of the input data in the channel (S1103). At this time, the error correction unit 1213 transmits the corrected data to the CODE_V decoding processing unit as intermediate data in the channel.

When an error of 2 bits or more occurs, the syndrome decoding unit 1214 cannot uniquely identify an erroneous bit position from the syndrome value. Therefore, the error correction unit 1213 cannot correct the error. When the syndrome decoding unit 1214 detects an error of 2 bits or more (S1102: uncorrectable error detection), it sets an error occurrence flag indicating that an error of 2 bits or more has occurred in the channel, and the error occurrence flag in the channel The signal is transmitted to the CODE_V decoding processing unit 1220 (S1104). At this time, the error correction unit 1213 does not correct the input data in the channel but transmits the data as it is to the CODE_V decoding processing unit 1220 as intermediate data in the channel.

The CODE_H decoding processing unit 1210 performs the processing in steps S1101 to S1104 for all codewords in all channels, that is, CODE_H. The above is the decoding processing by the CODE_H decoding processing unit 1210. Next, decryption processing by the CODE_V decryption processing unit 1220 will be described.

The code CODE_V is applied to 16 channels of data including two stacked memories. Specifically, the syndrome generation unit 1221 generates a syndrome in CODE_V for data obtained by collecting the bytes of intermediate data for 16 channels output from the CODE_H decoding processing unit 1210 (S1105). The syndrome generation unit 1221 transmits the generated syndrome to the syndrome decoding unit 1223.

The syndrome decoding unit 1223 performs error detection and correction on the byte using the syndrome generated by the syndrome decoding unit 1223 (S1106). At this time, the error occurrence flag checking unit 1224 detects that the error occurrence flag is set based on the error occurrence flag signal received from the CODE_H decoding processing unit 1210.

The syndrome decoding unit 1223, in the SEC-DED-S14ED code, according to the generated syndrome, similar to the syndrome decoding unit 1214 using the SEC-DED code, whether or not an error has occurred, 1-bit error correction, and 2-bit error detection is performed. In addition, the syndrome decoding unit 1223 can detect a byte error having a byte length of 14 in the SEC-DED-S14ED code.

When the syndrome decoding unit 1223 does not detect an error (S1106: no error), the error check code decoding unit 715 normally ends the decoding process on the data in which the bytes are collected (S1110).

When the syndrome decoding unit 1223 detects a 1-bit error (S1106: 1-bit error), the error correction unit 1222 corrects the 1-bit error (S1107). Subsequently, the error check code decoding unit 715 normally ends the decoding process on the data in which the bytes are collected (S1110).

When the syndrome decoding unit 1223 detects a byte error or an error of 2 bits or more (S1106: byte error detection or 2-bit error detection), the error occurrence flag checking unit 1224 receives the error occurrence flag detection signal received from the syndrome decoding unit 1214 Based on the above, it is checked whether the set error occurrence flag is one place (S1108).

When the error occurrence flag checking unit 1224 determines that the set error occurrence flag is one (S1108: YES), it determines that a byte error has occurred in the channel in which the error occurrence flag is set. At this time, the error occurrence flag checking unit 1224 transmits an error occurrence channel signal including information on the channel to the error correction unit 1222. The error correction unit 1222 corrects the byte value output from the channel indicated by the error occurrence channel signal based on the syndrome value (S1109). Subsequently, the error check code decoding unit 715 normally ends the decoding process (S1110).

When the error occurrence flag checking unit 1224 determines that the set error occurrence flag is not one place (S1108: No), there is a possibility that an error has occurred in a plurality of channels. Accordingly, the error occurrence flag checking unit 1224 cannot uniquely identify which channel is faulty, so the error correction unit 1222 cannot correct an error. In this case, the error occurrence flag checking unit 1224 sets an uncorrectable error detection signal indicating that an uncorrectable error has been detected, and ends error detection in the data in which the relevant byte is collected (S1111). The CODE_V decoding processing unit 1220 performs the processing in steps S1105 to S1111 for all codewords in CODE_V. The above is the decoding processing by the CODE_V decoding processing unit 1220.

Since the SEC-DED code can only detect an error of 3 bits or more probabilistically, the CODE_H decoding processing unit 1210 may not be able to detect the error even if an error has occurred. . At this time, no error occurrence flag is set, and even if the CODE_V decoding processing unit 1220 detects a byte error, the error cannot be corrected. At this time, in step S1108, the error occurrence flag checking unit 1224 may set an uncorrectable error detection signal.

The uncorrectable error detection signal is sent to the memory controller 710, for example. The memory controller 710 may leave a record of errors, for example, by setting a value in a register indicating that an uncorrectable error has occurred. For example, the operating system may take measures such as restarting the system using error recording or excluding the memory address from the page allocation target.

Further, the error occurrence flag checking unit 1224 may notify not only that an uncorrectable error has occurred but also the address information on which the uncorrectable error has occurred, for example, to the memory controller 710 or the like. However, the processing when the occurrence of these uncorrectable errors is detected can be determined by individual system design and is not limited to the processing described above.

FIG. 14 shows a second example of the decoding process performed by the error check code decoding unit 715. Only the differences between FIG. 14 and FIG. 13 will be described. When the syndrome decoding unit 1223 does not detect an error (S1106: no error) or after the error correction unit 1222 corrects the 1-bit error in step S1107, the error occurrence flag checking unit 1224 sets the error occurrence flag. It is checked whether it has not been done (S2812). That is, the syndrome decoding unit 1223 performs a check in step S2812 when there is no error in CODE_V.

When the error occurrence flag is not set (S2812: Yes), the error check code decoding unit 715 normally ends the decoding process on the byte (S1110). When the error occurrence flag is set (S2812: No), the error occurrence flag checking unit 1224 sets an uncorrectable error detection signal indicating that an uncorrectable error is detected, and an error in the data in which the bytes are collected. The detection ends (S1111).

According to the processing in step S2812, the CODE_V decoding processing unit 1220 can detect the occurrence of an error according to the setting state of the error occurrence flag even when an error that can be detected only probabilistically cannot be detected. .

Hereinafter, a specific example of error control when a memory chip in the stacked memory fails will be described. FIG. 15 shows a first example of an error pattern when a memory chip in the stacked memory fails. In FIG. 15, the memory chip in which the channel 6 (CH6) in the stacked memory is mounted has failed, and all bytes of the channel 6 are incorrect. At this time, in step S1104, the syndrome decoding unit 1214 transmits an error occurrence flag signal indicating that the channel 6 is in error to the CODE_V decoding processing unit 1220.

At this time, only the channel 6 has the error occurrence flag set. Accordingly, in step S1108, the error occurrence flag checking unit 1224 determines that a byte error has occurred in the channel 6, and transmits an error occurrence channel signal to the error correction unit 1222. In step S1109, the error correction unit 1222 corrects the byte value output from the channel indicated by the error occurrence channel signal in each data based on the syndrome value.

For example, as shown in FIG. 15, when a chip failure occurs and all bytes in the chip are incorrect, the failed chip continues to output an error, so that a byte error is detected stochastically at the next memory access. The At this time, it is not preferable that the CODE_H decoding processing unit 1210 detects the byte error again stochastically.

Therefore, the CODE_V decoding processing unit 1220 may include a storage element that holds information indicating that each chip has failed. The CODE_V decoding processing unit 1220 writes, for example, a value indicating that the chip is defective at the time of byte error correction to the storage element. The error occurrence flag checking unit 1224 checks the error occurrence flag by combining the failure chip information and the error occurrence flag signal output by the CODE_H decoding processing unit 1210 when the second or subsequent byte error is detected after the byte error correction. Do.

FIG. 16 shows a second example of an error pattern when a memory chip in the stacked memory fails. When the error occurrence flag is set in a plurality of channels, the CODE_V decoding processing unit 1220 cannot uniquely identify an error occurrence chip, and thus cannot correct an error caused by a memory chip failure. In FIG. 16, the memory chip 6 on which the channel 6 is mounted is faulty, and at the same time, two or more bits (B1 and B5) of the data in the channel 2 are incorrect.

In this case, since the CODE_H decoding processing unit 1210 detects an error of 2 bits or more in each of the channel 2 and the channel 6, it sets an error occurrence flag indicating that the two channels of the channel 2 and the channel 6 are errors.

If the syndrome decoding unit 1223 detects an error in the CODE_V decoding process, the error occurrence flag checking unit 1224 may have an error in two of the channel 2 and the channel 6, so which channel is out of order. Cannot be uniquely identified and the error cannot be corrected. In this case, in step S1111, the error occurrence flag checking unit 1224 sets an uncorrectable error detection signal indicating that an uncorrectable error has been detected.

FIG. 17 shows a third example of an error pattern when a memory chip in the stacked memory fails. In FIG. 17, B1 of channel 2 and channel 6 is incorrect. The crosses in FIG. 17 represent byte errors. At this time, it is assumed that the CODE_H decoding processing unit 1210 detects an error of at least one of the channel 2 and the channel 6 using the SEC-DED code and sets an error occurrence flag in the channel where the error is detected. At this time, since B1 has a 2-byte error, the CODE_V decoding processing unit 1220 can only detect the error in the SEC-DED-S14ED code only probabilistically. Therefore, the CODE_V decoding processing unit 1220 may not be able to detect the error.

Therefore, even when the syndrome decoding unit 1223 determines that there is no error (syndrome is 0), the error occurrence flag checking unit 1224 confirms the error occurrence flag, that is, the error detection is performed by performing the process of step S2812. Ability can be improved.

The error control system of the present embodiment detects and corrects a byte error by estimating a memory chip that is likely to have a memory chip failure with the code CODE_H when a byte error is detected with the code CODE_V. It can be carried out. As a result, the error control system of this embodiment can detect and correct a long bit error due to a failure of a memory unit or the like.

In addition, the error control system of the present embodiment uses a product code configured such that an overlapping part of a code word in an arbitrary CODE_H and a code word in an arbitrary CODE_V is equal to or less than the detection length of a byte error in CODE_V. Perform error control. In the error control system of the present embodiment, when the CODE_H decoding processing unit 1210 detects an uncorrectable error by performing error control using the code, the CODE_V decoding processing unit 1220 uniquely identifies the channel in which the error has occurred. Can be confirmed.

FIG. 18 shows an example of an error pattern in 2-cycle data output from the same channel. In the error control system of this embodiment, CODE_H is applied to 2-cycle data output from the same channel. Here, consider a case where a permanent failure occurs in which a specific bit in the channel 130 is permanently erroneous. In this case, as shown in FIG. 18, the number of bit errors in the data for two cycles is even when the number of bit errors in the channel 130 is an even number (CASE 1) and the case of an odd number (CASE 2).

For example, among the SEC-DED codes, a code called an odd-weighted SEC-DED code in which the weights of all the column vectors of the check matrix are odd numbers can detect an even number of errors in the data with high probability. . Here, the vector weight is the number of non-zero elements of the vector. Therefore, as in this embodiment, for example, when a code is applied to data that passes through the same data path such as the same channel for an even number of cycles, an error can be detected with high probability by applying an odd weight code. it can.

In addition, the example in which the code of (286,276) SEC-DED code is applied as CODE_H and the code of (224,208) SEC-DED-S14ED is applied as CODE_V has been shown so far. May be applied. Hereinafter, an example in which CODE_H is another code will be described.

First, CODE_H has an error detection function, and CODE_V is a code that can specify that the error has occurred in a specific memory chip (specific channel) using the detection result of CODE_H. That's fine. For example, CODE_H may be a single parity check code. At this time, the error control system can perform error detection and correction similar to those described above. The single parity check code is a code that uses a value calculated by XOR of bits constituting data as one check bit. That is, the single parity check code can detect an odd number of errors in the codeword.

When CODE_H is a single parity check code, as in the case where CODE_H is a SEC-DED code, the syndrome decoding unit 1214 sets an error occurrence flag of the channel when an error is detected. That is, when CODE_H is a single parity check code, a method similar to the correction method when CODE_H is a SEC-DED code can be applied.

Also, since a single parity check matrix can generate check bits with one bit, the error check code encoding unit 713 does not need to collect check data for two cycles and collect one cycle. CODE_H code processing can be applied to one channel of data.

By applying a single parity code as CODE_H, a code can be applied to data for one cycle instead of two cycles. Therefore, the memory access granularity is reduced, and the convenience of the stacked memory is improved. Similarly, CODE_H may be, for example, a checksum or a cyclic redundancy check code (CRC).

CODE_V may be a code that can detect an error such as a memory chip, a memory unit corresponding to a channel, or a bank unit. CODE_V may be, for example, a SEC-DED-SbED code (b is an arbitrary positive integer). The SEC-DED-SbED code is a code that has the function of the SEC-DED code and can detect a one-byte error when the byte length is b bits.

FIG. 19 shows a second example of the byte division format of the channel. For example, CODE_V may be a SEC-DED-S30ED code applied to data in which bytes of b = 30 are collected as shown in FIG. In the SEC-DED-S30ED code, B0 to B3 are 28 bits of data and 2 bits of check bits, B4 is 28 bits of data bits and 2 bits of check bits, and C0 is a check bit of 5 bits of CODE_H. And 1 check bit. Note that as the byte length b in the SEC-DED-SbED code is shorter, the error detection and correction capability in the entire memory chip is improved, and the scale of the encoding and decoding circuits is reduced.

Up to this point, an example in which one channel is arranged in each memory chip of an 8-layer stacked memory as shown in FIG. 6 will be described below. However, in order to follow the same I / F specification, the number of channels is assumed to be constant. In order to configure a stacked memory with an increased capacity, for example, when 16 memory chips are stacked, this embodiment can be easily applied. On the other hand, a case where four layers are stacked in order to reduce the capacity is considered.

FIG. 20 shows an example of a channel configuration in which two channels are arranged in each memory chip. Channel 0 (130) and channel 1 (131) are arranged in the memory chip 0 (110). In the case of the configuration of FIG. 20, for example, when a failure occurs in the entire memory chip 0 (110) such as a power supply or a clock, two channels fail.

FIG. 21 shows an example of code application in a memory chip in which two channels are arranged. In FIG. 21, the product code of CODE_H and CODE_V is applied to the four stacked memories 0 to 3 (100 to 103). For example, as shown in FIG. 21, a method is conceivable in which an output byte from each memory chip included in CODE_V is 1 byte. In FIG. 21, for example, CODE_V is applied to data in memory units corresponding to CH_0, CH_2, CH_4,..., CH_30. Here, CODE_V is applied to each byte set obtained by dividing the channel in FIG.

Also, CODE_H is applied to 2-cycle data in the memory unit corresponding to the same channel. If CODE_V and CODE_H are configured as shown in FIG. 21, the number of stacked memories required to apply CODE_V and CODE_H is increased, but the memory access granularity is the same as the code illustrated in FIG.

FIG. 22 shows an example of code application in a stacked memory including a four-layer memory chip in which two channels are arranged. In FIG. 22, CODE_H and CODE_V are applied to the stacked memory 0 (100) and the stacked memory 1 (101). For example, CODE_V is applied to data in memory units corresponding to CH_0, CH_2,..., CH_14, CH_1, CH_3,. Also, CODE_H is applied to two cycles of data in the memory unit corresponding to the same channel.

That is, in FIG. 22, a memory unit including bytes encoded by CODE_V includes memory units corresponding to all channels configured in the same memory chip. By configuring CODE_V as shown in FIG. 22, the failure of the memory chip does not cover the entire memory chip such as the power supply and the clock, but is limited to one channel in the memory chip, for example, the failure of the address decoder. With respect to error control for failure, the method of this embodiment is applicable.

Heretofore, an example in which the code CODE_V is applied to a data set in which 16 channels are combined has been shown assuming that a failure occurs in units of channels when a memory chip fails. FIG. 23 shows an example of code application assuming that a permanent failure of a memory chip occurs in a specific bank in a channel. For example, a sense amplifier failure in the bank is applicable in this case.

In this case, the same method can be configured by applying the configuration applied to the channel in this embodiment to the bank. That is, when the channel is composed of 8 banks, the same method can be realized by applying the code CODE_V to, for example, a data set obtained from 16 banks of 2 channels as shown in FIG. That is, CODE_V is applied with each bank as a memory unit. At this time, CODE_H is applied to the data of the same channel and the same bank for two cycles.

Unlike the method of grouping the channels in FIG. 7 and the like, the method of applying codes to the data collected in units of banks as shown in FIG. 23 can effectively utilize the channel level parallelism because it does not occupy the channels. It is. Therefore, the code application example shown in FIG. 23 is suitable for a system that requires channel level parallelism.

As described in the decoding process of the present embodiment, the code of FIG. 7 is applied to a data set of 4,096 (= 512 bytes) bits which is a collection of 16 channels × 2 cycles including 128 bits of data. The Therefore, when the memory access granularity from the processor 720 or the DMA control unit 730 that reads / writes data from / to the memory is not 4,096 bits, for example, the memory controller 710 further includes a storage mechanism. The storage mechanism may be included in the error check code decoding unit 715 or may be included in the read control unit 714, for example.

Consider a case where a read request by the processor 720 or the DMA control unit 730 is smaller than 4,096 bits. For example, the cache line size on many processors is 512 bits (64 bytes) or 1024 bits (128 bytes), so one memory access that occurs on a cache miss may be less than 4,096 bits .

At this time, the read control unit 714 once reads out 4,096-bit encoded data including, for example, 512-bit (or 1,024-bit) data requested by the processor 720. The error check code decoding unit 715 performs a decoding process on the read data, and then the storage mechanism temporarily caches a portion that is not originally requested data.

When the read control unit 714 receives a read request from the processor 720 or the DMA control unit 730, the read control unit 714 first checks whether the requested data is stored in the cache. If the requested data is stored in the cache, the data is transmitted from the cache. In general, reading to a memory often has locality in terms of time and space, so that the cache memory holds data, thereby improving the performance at the time of reading.

When the write request by the processor 720 or the DMA control unit 730 is less than 4,096 bits, the read control unit 714 needs to read the remaining bits constituting the code. For example, when the processor chip 700 does not have a cache and the last updated data is stored only in the stacked memory 100, the read control unit 714 reads out other part of the data from the stacked memory 100, and the error check code The encoding unit 713 applies a code together with the other parts.

Therefore, the write control unit 712 may have a function of sending a read request to the read control unit 714 and may include a buffer that temporarily stores received data. In addition, for example, when the data update request source to the stacked memory 100 includes a cache, when the last updated data is stored in a memory other than the stacked memory 100, the write control unit 712 is updated last. It has a function of reading data from a memory including data, and after reading other data necessary for encoding from the memory, it is encoded together with the data.

Also, writing to the memory has locality in time and space as well as reading. In order to use these localities, the write control unit 712 does not immediately write data to the stacked memory 100 when a write request is accepted, but temporarily buffers the data, thereby improving the performance during writing.

In the error control system of the first embodiment, when the CODE_H decoding processing unit 1210 determines that the error cannot be corrected, the CODE_V decoding processing unit 1220 is identical in order to uniquely identify in which memory chip the error has occurred. The code was applied to the data for two cycles output from the memory chip.

Thus, by devising how to combine the codes so that one byte in the data to which a certain CODE_V code is applied becomes one in the data to which a certain CODE_H code is applied, when a byte error of CODE_V is detected, Byte error correction was made possible by using error detection information of CODE_H. The error control system of the present embodiment implements the same code application using four stacked memories.

FIG. 24 shows an example of code application using four stacked memories. In the error control system of the first embodiment, the data amount is expanded in the time direction in order to secure the check bits necessary for CODE_H, and the code is applied to the data for two cycles. The error control system of this embodiment is the same as that of the first embodiment by extending the data amount in the spatial direction to the required number of check bits using a system in which stacked memories 0 to 3 (100 to 103) are mounted. Implement error control.

That is, as shown in FIG. 24, CODE_V is applied to a 16-channel data set output from two stacked memories, and the data is a combination of 16-channel channels combined with CODE_V and another stacked memory channel. On the other hand, CODE_H is applied.

24, CODE_V is applied to 16 channels of the stacked memories 0 and 1 (100 and 101). Further, CODE_H is obtained for data combining the channels of the stacked memory 0 (100) and the stacked memory 2 (102) and data combining the channels of the stacked memory 1 (101) and the stacked memory 3 (103). Applies.

According to the code application method illustrated in FIG. 24, the channel indicated by the error occurrence flag that is set when an uncorrectable error is detected in the decoding of CODE_H when a memory chip fails is uniquely specified in the codeword of each CODE_V. It is determined. Therefore, the error control method in this embodiment can perform error detection and error correction by the same flow as in the first embodiment.

When CODE_V is applied to a memory unit corresponding to a channel as shown in FIGS. 21 and 22, a code can be applied to data of one cycle of four stacked memories in the same manner as in FIG. . Also, as shown in FIG. 23, when CODE_V is applied to a bank, a code can be applied to data in one cycle of four memory chips in the same manner as in FIG.

FIG. 25 shows a channel configuration example in the HMC. In the HMC, channels 530 are distributed in a plurality of memory chips 510, and output bits from each memory chip 510 use the same TSV set in a time division manner. For example, an HMC in which four memory chips 510 are stacked outputs a total of 36 bits, 32 bits of data bits and 4 bits of check bits, from each memory chip 510, and includes 128 bits of data bits and 16 bits of check bits for the four layers. Configure.

FIG. 26 shows an example of an error pattern when a TSV failure occurs in the HMC. Here, consider a case where the TSV connecting the memory chips 510 fails due to wear or the like. In this case, since the output bits from each memory chip 510 share the TSV, a plurality of bits passing through the faulty TSV may be erroneous.

If the 36 bits of each memory chip 510 are simply arranged in order, a 4-bit error will occur in 144 bits as shown in FIG. These 4-bit errors are not considered a batch of errors. Therefore, even if an S4ED-D4ED code that detects and corrects a group of 4-bit errors using 16-bit check bits for 128-bit data bits used in conventional x4 DIMMs, the error is detected and corrected. I can't do it.

Therefore, in this embodiment, instead of arranging data for each memory chip 510 to be output, the error check code encoding unit 713 and the error check code decoding unit 715 store data for each TSV that passes each bit. Rearranges. FIG. 27 is an example showing error patterns before and after the rearrangement. By this rearrangement, the error check code encoding unit 713 can handle 4-bit errors that occur in a jump as a single 4-bit error.

At this time, the error check code encoding unit 713 uses, for example, a byte having a length of 4 to give a 16-bit check bit to 128-bit data, which is used for 4-bit byte error correction in a conventional x4 DIMM or the like. By applying the error correction code, an S4ED-D4ED (Single 4-bit Error Detection-Double 4-bit Error Detection) code can be applied.

That is, by applying this method, it is possible to apply a control code that performs byte error detection when a code is applied to a lump of data that shares time-sharing TSVs as shown in FIG. Therefore, the error control system can detect and correct a 4-bit error that occurs when one TSV failure occurs and can detect an 8-bit error that occurs when two TSV failures occur by performing the rearrangement in FIG. It becomes. In general, the error control system can apply the method not only to the TSV but also to the code for a lump of data that uses the same hardware resource in a time-sharing manner.

In addition, this invention is not limited to the above-mentioned Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

Also, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

Claims

An error correction device that reads data from a memory device and corrects an error in the read data,
The memory device holds first data encoded by a code process in a product code of a first type code and a second type code,
In the first data, an overlapping data length of an arbitrary first type codeword encoded by the first type code and an arbitrary second type codeword encoded by the second type code is Is less than or equal to the byte length in the type 2 code,
The error correction device includes:
A read control unit for reading the first data from the memory device;
A second decoding processing unit for performing decoding processing in the second type code on the read first data;
A first decoding processor that performs a decoding process on the first type code for the first data that has been decoded by the second decoding processor;
When the second decoding processing unit detects an uncorrectable error in the first type 2 codeword acquired from the memory device, the error flag indicates that the first type 2 codeword includes an error. Set
The first decoding processing unit includes a first first type code including a first byte included in the first second type code word and including bytes stored in different memory units in the memory device. An error correction device for correcting an error in the first byte based on the error flag in a word.
The error correction device according to claim 1,
The memory device further includes a plurality of memory chips as the memory unit,
The first type code word includes bytes stored in each of the plurality of memory chips,
The error correction apparatus, wherein the second type code word includes data of a plurality of cycles of one memory chip.
The error correction device according to claim 1,
The memory device further includes a plurality of channels for inputting and outputting data in the memory unit,
The first type codeword includes bytes input / output in each of the plurality of channels,
The error correction apparatus, wherein the second type codeword includes data of a plurality of cycles of a memory unit corresponding to one channel.
The error correction device according to claim 1,
The memory device further includes a plurality of banks as the memory unit,
The first type codeword includes bytes stored in each of the plurality of banks,
The second type code word is an error correction device including data of a plurality of cycles in one bank.
The error correction device according to any one of claims 1 to 4,
The first type code has a bit error detection capability;
The second type code is an error correction device having byte error detection capability.
The error correction device according to claim 5,
The first type code is a SEC-DED-SbED code (b is a positive integer),
The error correction apparatus, wherein the second type code is a SEC-DED code.
The error correction device according to any one of claims 2 to 4,
The plurality of cycles are even cycles;
The error correction apparatus, wherein the second type code is an odd weight code.
The error correction device according to claim 1,
The memory device further includes a plurality of memory chip groups,
The first type code word includes bytes stored in each memory chip included in one memory chip group,
The second type code word includes data of memory chips included in different memory chip groups,
The error correction apparatus, wherein the memory unit is a memory chip included in the one memory chip group.
The error correction device according to claim 1,
When the first decoding processing unit determines that there is no error in the first type code in the first first type codeword and the error flag is set, the first first type codeword An error correction device that determines that an uncorrectable error exists in a codeword.
The error correction device according to claim 1,
When the read control unit receives a read request for a part of the first data, the first data subjected to the decoding processing by the first decoding processing unit and the second decoding processing unit is cached,
The error correction apparatus, wherein the read control unit reads the cached first data in response to a read request for another part of the first data.
An error correction method for reading data from a memory device and correcting an error in the read data,
The memory device holds first data encoded by a code process in a product code of a first type code and a second type code,
In the first data, an overlapping data length of an arbitrary first type codeword encoded by the first type code and an arbitrary second type codeword encoded by the second type code is Is less than or equal to the byte length in the type 2 code,
The error correction method is:
A first procedure for reading the first data from the memory device;
A second procedure for performing a decoding process in the second type code on the read first data;
A third procedure for performing a decoding process in the first type code on the first data subjected to the decoding process in the second procedure,
In the second procedure, when an uncorrectable error is detected in the first type 2 codeword obtained from the memory device, an error flag indicating that the first type 2 codeword includes an error is set. Including setting,
The third procedure includes a first byte of a first type codeword including a first byte included in the first type-2 codeword and stored in each of different memory units in the memory device. And correcting an error in the first byte based on the error flag.
The error correction method according to claim 11,
The memory device further includes a plurality of memory chips,
The first type codeword consists of bytes stored in each of the plurality of memory chips,
The second type code word is composed of data of a plurality of cycles of one memory chip,
The error correction method, wherein the memory unit is each of the plurality of memory chips.
The error correction method according to claim 11,
The memory device further includes a plurality of channels that are interfaces to the memory device;
The first type codeword consists of bytes stored in each of the plurality of channels,
The second type codeword is composed of data of a plurality of cycles of one channel,
The error correction method, wherein the memory unit is the plurality of channels.
The error correction method according to claim 11,
The memory device further includes a plurality of banks,
The first type codeword consists of bytes stored in each of the plurality of banks,
The second type code word is composed of a plurality of cycles of data in one bank,
The error correction method, wherein the memory unit is the plurality of banks.
A memory device;
An error correction system including an error correction device that reads data from the memory device and corrects an error in the read data,
The memory device holds first data encoded by a code process in a product code of a first type code and a second type code,
In the first data, an overlapping data length of an arbitrary first type codeword encoded by the first type code and an arbitrary second type codeword encoded by the second type code is Is less than or equal to the byte length in the type 2 code,
The error correction device includes:
A read control unit for reading the first data from the memory device;
A second decoding processing unit for performing decoding processing in the second type code on the read first data;
A first decoding processor that performs a decoding process on the first type code for the first data that has been decoded by the second decoding processor;
When the second decoding processing unit detects an uncorrectable error in the first type 2 codeword acquired from the memory device, the error flag indicates that the first type 2 codeword includes an error. Set
The first decoding processing unit includes a first first type code including a first byte included in the first second type code word and including bytes stored in different memory units in the memory device. An error correction system for correcting an error in the first byte based on the error flag in a word.