WO2015014301A1 - 一种消息式内存模组的访存方法和装置 - Google Patents

一种消息式内存模组的访存方法和装置 Download PDF

Info

Publication number
WO2015014301A1
WO2015014301A1 PCT/CN2014/083464 CN2014083464W WO2015014301A1 WO 2015014301 A1 WO2015014301 A1 WO 2015014301A1 CN 2014083464 W CN2014083464 W CN 2014083464W WO 2015014301 A1 WO2015014301 A1 WO 2015014301A1
Authority
WO
WIPO (PCT)
Prior art keywords
scbc
error
memory
read
dram
Prior art date
Application number
PCT/CN2014/083464
Other languages
English (en)
French (fr)
Inventor
高翔
李冰
单书畅
胡瑜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP14832194.6A priority Critical patent/EP3015986B1/en
Priority to KR1020167003393A priority patent/KR101837318B1/ko
Publication of WO2015014301A1 publication Critical patent/WO2015014301A1/zh
Priority to US15/010,326 priority patent/US9811416B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1068Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices in sector programmable memories, e.g. flash disk
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/108Parity data distribution in semiconductor storages, e.g. in SSD
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/52Protection of memory contents; Detection of errors in memory contents

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a method and device for accessing a message type memory module.
  • the reliability of the memory plays a decisive role.
  • the failure rate of the memory system increases exponentially.
  • the possibility of memory errors increases, and the number of errors increases.
  • ECC memory that is, a memory module with an ECC check code
  • the basic idea is to protect the data by the memory module bit width as a basic unit.
  • the memory module has a bit width of 64 bits as an example, each time writing 64 bits.
  • the data is simultaneously calculated for the data.
  • the 8-bit check bits are stored in a separate ECC chip. These 64 data bits are combined with 8 check bits to form a 72-bit ECC word.
  • This encoding method can be used for 72-bit ECC words. Any one of the errors will be corrected, but for the case of two errors, it can only be detected and cannot be corrected. For more cases where errors occur, it is even more powerless.
  • Chipkill technology can tolerate any DRAM chip damage on any DIMM.
  • Chipkill technology achieves high reliability through wider MC bit width and larger granularity data encoding.
  • this technology can only be used for 4-bit wide DRAM chips, which lacks flexibility and is too large. The data encoding causes it to read much more data each time than the actual memory access request, resulting in a large amount of unnecessary power loss.
  • Embodiments of the present invention provide a method and apparatus for memory access of a message memory module to provide a low power consumption, high reliability, variable granularity memory access fault tolerance solution.
  • a first aspect of the present invention provides a memory access device of a message type memory module, wherein the memory module includes a (M+2) block dynamic random access memory (DRAM), M is equal to 2 m power, and m is a positive integer;
  • the data stored in each DRAM that can be fetched in one read/write cycle is called a single-chip burst cluster SCBC, and the set of data stored in all DRAMs that can be fetched in the same read/write cycle forms a memory row;
  • the device includes:
  • a read/write module configured to store the SCBC to be stored in the current read and write cycle into the corresponding DRAM and located in the current memory row, and the DRAM for storing the SCBC does not include the (M+2)th DRAM;
  • a processing module configured to respectively calculate a set of error detection codes for each SCBC in one memory row, and calculate a set of error correction codes for all SCBCs in one memory row;
  • the read/write module is further configured to store an error detection code calculated for one memory row in the (M+2)th DRAM of the memory row, and store the error correction code calculated for one memory row in the In the Zth DRAM of the memory row, Z is a positive integer and 1 Z (M+1), continuous (M+1) memory lines
  • the error correction codes are stored in different DRAMs.
  • the processing module is further configured to: when receiving the read access request, instruct the read/write module to read the required SCBC and the corresponding error detection code from the current memory line. And checking the read SCBC according to the error detection code to determine whether there is an SCBC error, and when determining that there is an SCBC error, obtaining the number of SCBCs in which the error occurs, and if there is only one SCBC error, The reading module is instructed to read all the data of the memory line, and recover the SCBC of the error according to the error correction code in the memory line and other SCBCs that have not generated an error.
  • the processing module is further configured to: when receiving a write access request, first determine, to be written Whether the number X of the second SCBC is less than or equal to M/2; if X is less than or equal to M/2, instructing the read/write module to read the first error detection code and the first error correction code stored in the current memory line and Calculating the X first SCBCs stored in the X DRAMs of the second SCBC, determining whether there is a first SCBC error according to the first error detection code, and calculating the X second SCBCs when there is no first SCBC error a second error detection code, and calculating a second error correction code according to the first error correction code and the X first SCBCs and the X second SCBCs, instructing the read/write module to the X Writing the second SCBC and the second error correcting code and the second error detecting code into the corresponding DRAM; if X is greater than M/2, instruct
  • the processing module is further configured to: when determining that the first SCBC error occurs, obtain the number of the first SCBC in which the error occurs. If there is only one first SCBC error, instructing the reading module to read all data of the memory line, according to the error correction code in the memory line and other SCBCs that have not generated an error, The first SCBC is restored.
  • a second aspect of the present invention provides a method for fetching a message memory module, wherein the memory module includes a (M+2) block dynamic random access memory (DRAM), M is equal to 2 m power, and m is a positive integer; Every The data stored in the block DRAM that can be fetched in one read/write cycle is called a single chip burst cluster SCBC, and the set of data stored in all DRAMs that can be fetched in the same read/write cycle forms a memory row;
  • M block dynamic random access memory
  • the method includes:
  • the SCBC to be stored in the current read and write cycle is stored in the corresponding DRAM and located in the current memory row, and the DRAM for storing the SCBC does not include the (M+2)th DRAM;
  • the error detection code calculated for one memory row is stored in the (M+2)th DRAM of the memory row, and the error correction code calculated for one memory row is stored in the Zth DRAM of the memory row.
  • Z is a positive integer and 1 Z ( M+1 ), and the error correction codes in consecutive (M+1) memory lines are stored in different DRAMs, respectively.
  • the method further includes: when receiving the read access request, reading the required SCBC and the corresponding error detection code from the current memory line, according to the error detection code pair
  • the read SCBC is checked to determine whether there is an SCBC error. When it is determined that there is an SCBC error, the number of SCBCs in which the error occurred is obtained. If there is only one SCBC error, all data of the memory line is read. The SCBC of the error that occurred is restored according to the error correction code in the memory line and other SCBCs in which no error has occurred.
  • the method further includes: determining, when receiving the write access request, the second SCBC to be written Whether the number X is less than or equal to M/2; if X is less than or equal to M/2, reading the first error detecting code and the first error correcting code stored in the current memory line and the X pieces to be written into the second SCBC Calculating X first SCBCs stored in the DRAM, determining whether there is a first SCBC error according to the first error detection code, and calculating a second error detection code of the X second SCBCs when there is no first SCBC error, and according to The first error correcting code and the X first SCBCs and the X second SCBCs calculate a second error correcting code, and the X second SCBCs and second error correcting codes and the second error detecting code Write to the corresponding DRAM; if X is greater than M/2, read the first error detection code and the first error correction code stored in the
  • the method further includes: acquiring the number of the first SCBC in which the error occurs, if any and only If there is an error in the first SCBC, all data of the memory line is read, and the first SCBC of the error is recovered according to the error correction code in the memory line and other SCBCs in which no error has occurred.
  • the technical solution of the embodiment of the present invention performs fine-grained coding protection by using SCBC as a basic read/write unit, but there is no particular limitation on the size of the SCBC, that is, the bit width and the burst length of the DRAM memory chip are not limited, and thus the variable can be supported.
  • Granular memory access, and the use of independent DRAM storage error detection code can achieve error detection for different memory access granularity, error detection code with error correction code can achieve error correction for any multi-bit error in a single DRAM, is a low-power A highly reliable solution.
  • FIG. 1 is a schematic diagram of a memory access device of a message memory module according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a memory access device of a message memory module according to an embodiment of the present invention
  • a flowchart of a method for accessing a message-based memory module is provided.
  • FIG. 4a and FIG. 4b are schematic diagrams of a memory control system according to an embodiment of the present invention.
  • Embodiments of the present invention provide a method for accessing a message memory module to provide a low power consumption, high reliability, variable granularity memory access fault tolerance solution. Embodiments of the present invention also provide corresponding devices. The details are described below separately. Embodiment 1
  • an embodiment of the present invention provides a memory access device of a message type memory module.
  • the device can be deployed in the peripheral control circuit of the memory module or in the memory controller; the memory controller can be integrated into the central processing unit (CPU) or integrated on the computer motherboard.
  • CPU central processing unit
  • the memory module may specifically be a DIMM, and the DIMM includes a plurality of DRAMs.
  • the bit width of the DRAM is N bits, N is equal to the nth power of 2, and n is a positive integer; the Burst Length (BL) of the DRAM is Q, Q is a positive integer, and preferably Q is equal to 2 times.
  • (M x N) is the bit width of the entire memory module, because the computer memory module bit width (M x N) is generally 2
  • each DRAM that can be accessed in one read/write cycle is called Single Chip Burst Cluster (SCBC), and each SCBC includes (N x Q) bit data.
  • SCBC Single Chip Burst Cluster
  • Each DRAM can also be considered a memory column. In other words, a memory line represents a read and write cycle, and a memory column represents a block of DRAM.
  • the memory access device 100 includes a read/write module 110 and a processing module 120;
  • the read/write module 110 is configured to store the SCBC to be stored in the current read/write period into the corresponding DRAM and located in the current memory row, and the DRAM for storing the SCBC does not include the (M+2)th DRAM; The number of SCBCs to be stored in a read/write cycle does not exceed M;
  • the processing module 120 is configured to separately calculate a set of error detection codes for each SCBC in one memory row, and calculate a set of error correction codes for all SCBCs in one memory row;
  • the read/write module no is further configured to store an error detection code calculated for one memory row in the (M+2)th DRAM of the memory row, and store the error correction code calculated for one memory row in the In the Zth DRAM of the memory row, Z is a positive integer and 1 Z M+1 ), continuous (M+1)
  • the error correction codes in the stored lines are stored in different DRAMs, respectively.
  • the processing module 120 is further configured to: when receiving the read access request, instruct the read/write module to read the required SCBC and the corresponding error detection code from the current memory line, according to the check
  • the error code checks the read SCBC to determine whether there is an SCBC error. When it is determined that there is an SCBC error, the number of SCBCs that have an error is obtained. If there is only one SCBC error, the reading is instructed.
  • the module reads all the data of the memory line, and recovers the SCBC of the error according to the error correction code in the memory line and other SCBCs that have not generated an error. If there are two or more SCBC errors, the processing module 120 reports an unrecoverable error to the master device, such as the memory controller.
  • the processing module 120 is further configured to: first determine whether the number X of the second SCBC to be written is less than or equal to M/2; if X is less than or equal to M/2, instruct the read/write module to read The first error detecting code and the first error correcting code stored in the current memory line and the X first SCBCs stored in the X DRAMs to be written in the second SCBC, and determining whether there is a first according to the first error detecting code
  • An SCBC error when there is no first SCBC error, calculating a second error detection code of the X second SCBCs, and according to the first error correction code and the X first SCBCs and the X seconds
  • the SCBC calculates a second error correcting code, instructing the read/write module to write the X second SCBCs and the second error correcting code and the second error detecting code into the corresponding DRAM; if X is greater than M/2, the indication
  • the read/write module reads the first error detection code and the first error correction code stored in
  • the processing module 120 is further configured to: when determining that there is a first SCBC error, obtain the number of the first SCBC in which the error occurs, and if there is only one first SCBC error, the reading is indicated. The module reads all the data of the memory line, and recovers the first SCBC of the error according to the error correction code in the memory line and other SCBCs that have not generated an error. In more than two When an error occurs in the SCBC in the DRAM, the processing module 120 reports an unrecoverable error to the master device, such as the memory controller.
  • the first SCBC, the first error correcting code and the first error detecting code may be understood as the currently stored SCBC, the error correcting code and the error detecting code; the second SCBC, the second error correcting code and the second error detecting code are The SCBC, error correcting code, and error detecting code to be rewritten will cover the first SCBC, the first error correcting code, and the first error detecting code described above.
  • the memory module of a computer system is usually 64 bits. Assuming that a DRAM with a bit width N of 8 is selected, the number of DRAMs for storing SCBC should be 64 divided by 8 equal to 8. In the following, taking M and N as equal to 8 as an example, the device of the embodiment of the present invention is further described in detail:
  • the data encoding structure of the memory module is as shown in FIG. 2, and the 10 columns in the figure respectively represent the first to (M+2) DRAMs.
  • the (M+2) DRAM is the last column in the figure for storing the error detection code
  • one DRAM in the first to (M+1) DRAM is used for storing the error correction code
  • the other eight DRAMs are used for storing the SCBC.
  • Data ( D ) is used to represent the memory access data
  • Parity ( P ) is used to represent the error correction code
  • Checksum ( C ) is used to indicate the error detection code.
  • the SCBC stored in the same memory line is represented by D0 ⁇ D7
  • the error correction codes stored in consecutive (M+1) memory lines are represented by P0 ⁇ P8 respectively
  • the check is stored in consecutive (M+1) memory lines.
  • the error codes are represented by C0 ⁇ C8 respectively.
  • the memory access data, the error correction code and the error detection code are all SCBC, but in this context, the SCBC refers specifically to the memory data (D).
  • the fault-tolerant coding in this embodiment is divided into two parts: an error detection code and an error correction code.
  • the calculation of the error detection code part includes: calculating a checksum sequentially for 8 SCBCs on the same memory line, and the checksum algorithm may 8Using 8-bit modulo-2 addition or 8-bit binary inverse code sum, each SCBC calculates an 8-bit check code, and 8 SCBCs in the same line calculate a 64-bit check code, which is stored in the dedicated (M+2)
  • the first memory row corresponding to the DRAM chip is used as an error detection code;
  • the calculation of the error correction code portion includes: performing an exclusive OR operation on the corresponding bits of the eight SCBCs in the same memory row, and generating a 64-bit parity check code, that is,
  • the error correction code P such as D0 ⁇ D7 corresponding bit XOR, obtains the error correction code P0 stored in the same line of the (M+1) DRAM; it should be noted that the error correction codes of consecutive multiple memory lines are stored in different D
  • the reason why the error correction code is stripe-shaped is that if all the error correction codes P of the memory lines are stored on the same DRAM chip, the error correction code P corresponding to the corresponding position on the chip needs to be updated when the data is continuously written. , the DRAM chip becomes an access hotspot, causing an increase in delay between the two write operations, and the update performance is degraded; and, when writing the error correction code P, it is necessary to pre-read all the data stored on the entire memory line, which further leads to performance.
  • the above problem can be solved by making the error correction code strip-shaped.
  • the device of this embodiment performs three processing sequences of read access memory processing, write memory access processing, and error recovery processing, which are described in detail below:
  • the DRAM chip reads the SCBC data, and simultaneously reads the error detection code of the current memory line, and checks the read SCBC data. If there is no SCBC error, the reading is continued, and the read data is uploaded; If there is an SCBC error on one and only one DRAM chip, error recovery processing is performed; if there are SCBC errors on multiple DRAM chips, an unrepairable error is reported to the master device.
  • the access memory granularity refers to the number of SCBC data that needs to be fetched.
  • the read access request RD (D0, D1, D2) requests to read D0-D2 for a total of 3 SCBC 24byte data, need to activate 4 DRAM chips, read D0-D2 and CO, according to the first 3 bytes of CO
  • the processing module When the write access request arrives, the processing module first determines the access granularity of the write access request.
  • the access granularity refers to the number of SCBC data that needs to be fetched.
  • the memory access granularity of the write access request that is, the number X of SCBC data to be written is less than or equal to M/2
  • reading the original SCBC data of the write access request to be written to the data location and reading together Take the original error correction code of the line and the original error detection code, and verify the original SCBC data read by the original error detection code. If there is only one SCBC error, the error recovery processing is performed, and if there is no error, the original error correction code is used.
  • the updated new error correction code ⁇ ' is obtained, and the new error detection code C' is simultaneously calculated for the written new SCBC data.
  • the read/write module writes the new SCBC and the new error correcting code and the new error detecting code to corresponding positions in the corresponding DRAM.
  • a D0 A D1 A D2 A D0 , A D1 , A D2 ' calculate the checksum of DO', Dl', D2' to update CO to CO', and finally write D0, Dl', D2,, ⁇ ' , C0,.
  • the read memory line in the write memory should not be modified.
  • Data SCBC, The error correction code P and the error detection code C of the line are read together, and the original data read by the error detection code C is checked into an error recovery process if there is an error, and the data to be written and the data just read are correct.
  • the data that should not be changed by the memory line is XORed to obtain the updated error correcting code P', and the written data is simultaneously calculated and the error detecting code is written into the corresponding position of the corresponding DRAM.
  • WR write access request
  • 6 SCBC total 48byte data
  • Judging that the write granularity of 6 SCBC is greater than half of the total of 8 SCBCs first read the data D6, D7 which is not changed after writing in the row, and the check codes P0 and CO of the same row, and check D6 and D7 according to CO.
  • the original SCBC data and the original error correction code and the original error detection code refer to the SCBC data and the error correction code and the error detection code originally stored in the memory module, that is, the first SCBC data and the foregoing a first error correcting code and a first error detecting code; the new SCBC data and the new error correcting code and the new error detecting code are updated SCBC data and an error correcting code and an error detecting code, that is, as described above
  • the second SCBC data and the second error correcting code and the second error detecting code refer to the SCBC data and the error correction code and the error detection code originally stored in the memory module, that is, the first SCBC data and the foregoing a first error correcting code and a first error detecting code; the new SCBC data and the new error correcting code and the new error detecting code are updated SCBC data and an error correcting code and an error detecting code, that is, as described above
  • the second SCBC data and the second error correcting code and the second error detecting code refer to the SCBC
  • the device in this embodiment performs error recovery. deal with. First, read all the data of the memory line where the error SCBC is located, including the SCBC without error, the error detection code C, the error correction code P, and the other peer data read out according to the error detection code C. The SCBC is XORed with the error correction code P to obtain the correct data of the entire block, and finally the correct data is rewritten.
  • the specific configuration of the technical solution of the embodiment of the present invention has certain flexibility, and the size of the coding granularity can be set according to different architecture designs by adjusting the bit width of the used chip and the burst length of the chip and the number of burst length clusters.
  • the x4 bit width and 4 burst length designs can be used to reduce the SCBC block size to 16 bits.
  • x16 bit widths can be used with two sets of 8 burst lengths, and at least 256 bits of data can be read at a time with two read and write cycles.
  • the design of x8 bit width and 4 burst lengths can also be used to reduce the SCBC block size to 32 bits, or the x16 bit width and 8 burst length designs can be used to reduce the SCBC block size to 128 bits. , or other means.
  • the bit width and the burst length are also free to select other combinations, wherein the bit width is generally a number of powers of two.
  • the embodiment of the present invention provides a memory access device of a message type memory module.
  • the device performs fine-grained coding protection with SCBC as a basic unit of reading and writing, but there is no particular limitation on the size of the SCBC, that is, a DRAM memory chip.
  • the bit width and burst length are not limited, so variable granular access can be supported, variable number of DRAM accesses can be supported, and independent memory access error detection codes can be used to detect errors in different memory accesses. Error code and error correction code can correct any multi-bit error in a single DRAM, which is a low power and high reliability solution.
  • the technical solution of the embodiment can implement any single-chip multi-error fault tolerance.
  • the number of activated and activated chips decreases by 44 ⁇ 83% (16 ⁇ 30/36), and the decrease of the activated chip data means that the dynamic power consumption decreases year by year.
  • This patent achieves Chipkill-level fault-tolerant protection with lower power consumption, and there is no Chipbit-wide chip bit width limitation, which makes it easier to expand.
  • this patent is based on the design of message memory, which achieves fine-grained protection and variable fault-tolerant granularity, providing more optimization space for the upper-level architecture design.
  • an embodiment of the present invention further provides a method for fetching a message type memory module.
  • the execution body of the method is a peripheral control circuit or a memory controller of the memory module, specifically the peripheral
  • the memory access device as described in the first embodiment is deployed in the control circuit or the memory controller.
  • the memory module includes (M+2) block dynamic random access memory (DRAM), M is equal to 2 m power, m is a positive integer; each block of DRAM can be stored in a read/write cycle.
  • the data is called a single-chip burst cluster SCBC, and the set of data stored in all DRAMs that can be fetched in the same read/write cycle forms an memory row.
  • the method includes:
  • the method further includes:
  • the required SCBC and the corresponding error detection code are read from the current memory line, the read SCBC is checked, and the SCBC error is determined according to the error detection code.
  • the number of SCBCs in which the error occurred is obtained. If there is one and only one SCBC error occurs, all data of the memory line is read, according to the error correction code in the memory line and other errors that do not occur. SCBC, recovers from the SCBC of the error that occurred.
  • the method further includes:
  • X is less than or equal to M/2, reading the first error detection code and the first error correction code stored in the current memory row and the X first SCBCs stored in the X DRAMs to be written in the second SCBC, according to Determining, by the first error detection code, whether there is a first SCBC error, and when there is no first SCBC error, calculating a second error detection code of the X second SCBCs, and according to the first error correction code and the X First
  • the SCBC and the X second SCBCs calculate a second error correcting code, and write the X second SCBCs and the second error correcting code and the second error detecting code into the corresponding DRAM;
  • the first error detection code and the first error correction code stored in the current memory line are read, and the first SCBC stored in the (MX) DRAM of the second SCBC is not written. Determining, according to the first error detection code, whether there is a first SCBC error, and when there is no first SCBC error, calculating a second error detection code of the X second SCBCs, and according to (MX) first SCBCs and Xs The second SCBC calculates a second error correcting code, and writes the X second SCBCs and the second error correcting code and the second error detecting code into the corresponding DRAM.
  • the method further includes:
  • the embodiment of the present invention provides a method for fetching a message type memory module.
  • the method performs fine-grained coding protection by using SCBC as a basic unit of reading and writing, but there is no particular limitation on the size of the SCBC, that is, a DRAM memory chip.
  • the bit width and burst length are not limited, so variable granular access can be supported, variable number of DRAM accesses can be supported, and independent memory access error detection codes can be used to detect errors in different memory accesses. Error code and error correction code can correct any multi-bit error in a single DRAM, which is a low power and high reliability solution.
  • the embodiment of the invention also provides a memory control system.
  • the system includes a message memory module 310, and the memory module 310 includes a peripheral control circuit 3101 and a (M+2) block DRAM 3102;
  • M is equal to 2 to the power of m
  • m is a positive integer
  • the data stored in each DRAM that can be fetched in one read/write cycle is called a single-chip burst cluster SCBC, and all DRAMs can be stored in the same
  • the set of data fetched during the read and write cycles forms an in-memory row.
  • the memory controller can be integrated on a computer motherboard or integrated into a CPU of a computer.
  • the peripheral control circuit performs the following steps: Storing the SCBC to be stored in the current read and write cycle into the corresponding DRAM and located in the current memory row, the DRAM for storing the SCBC does not include the (M+2)th DRAM; for each of the memory rows SCBC calculates a set of error detection codes, and calculates a set of error correction codes for all SCBCs in one memory row; stores the error detection code calculated for one memory row in the (M+2)th DRAM of the memory row. , the error correcting code calculated for one memory row is stored in the Zth DRAM of the memory row, Z is a positive integer and 1 ⁇ Z ⁇ ( M+1 ), in consecutive (M+1) memory rows The error correction codes are stored in different DRAMs, respectively.
  • the peripheral control circuit further performs the following steps:
  • the required SCBC and the corresponding error detection code are read from the current memory line, the read SCBC is checked, and the SCBC error is determined according to the error detection code.
  • the number of SCBCs in which the error occurred is obtained. If there is one and only one SCBC error occurs, all data of the memory line is read, according to the error correction code in the memory line and other errors that do not occur. SCBC, recovers from the SCBC of the error that occurred.
  • the peripheral control circuit further performs the following steps:
  • the number of the first SCBC in which the error occurred is also obtained, if If there is only one first SCBC error, all the data of the memory line is read, and the first SCBC of the error is recovered according to the error correction code in the memory line and other SCBCs that have not generated an error.
  • the system includes a message memory module 310 and a memory controller 320, and the memory module 310 includes a (M+2) block DRAM 3102;
  • M is equal to 2 to the power of m
  • m is a positive integer
  • the data stored in each DRAM that can be fetched in one read/write cycle is called a single-chip burst cluster SCBC, and all DRAMs can be stored in the same
  • the set of data fetched during the read and write cycles forms an in-memory row.
  • the memory controller can be integrated on a computer motherboard or integrated into a CPU of a computer.
  • the memory controller performs the following steps:
  • the DRAM for storing the SCBC does not include the (M+2)th DRAM; for each of the memory rows SCBC calculates a set of error detection codes, and calculates a set of error correction codes for all SCBCs in one memory row; stores the error detection code calculated for one memory row in the (M+2)th DRAM of the memory row.
  • the error correcting code calculated for one memory row is stored in the Zth DRAM of the memory row, Z is a positive integer and 1 ⁇ Z ⁇ ( M+1 ), in consecutive (M+1) memory rows
  • the error correction codes are stored in different DRAMs, respectively.
  • the memory controller further performs the following steps:
  • the required SCBC and the corresponding error detection code are read from the current memory line, the read SCBC is checked, and the SCBC error is determined according to the error detection code.
  • the number of SCBCs in which the error occurred is obtained. If there is one and only one SCBC error occurs, all data of the memory line is read, according to the error correction code in the memory line and other errors that do not occur. SCBC, recovers from the SCBC of the error that occurred.
  • the memory controller further performs the following steps:
  • the number of the first SCBC in which the error occurs is also obtained, and if there is only one first SCBC error, all data of the memory line is read, according to the correction in the memory line.
  • the error code and other SCBCs that did not have an error recover the first SCBC of the error that occurred.
  • the embodiment of the present invention provides a memory control system, where the memory module performs fine-grained coding protection with SCBC as a basic unit of reading and writing, but the size of the SCBC is not particularly limited, that is, the bit width of the DRAM memory chip.
  • the burst length is not limited, so it can support variable-grained memory access, support variable number of DRAM memory access, and use independent DRAM storage error detection code to detect error in different memory access granularity. Error code can solve any multi-bit error in a single DRAM, which is a low power and high reliability solution.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative, for example, the division of the modules or units is only a logical function division, There may be additional ways of dividing, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the components displayed for the unit may or may not be physical units, ie may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software function unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the instructions include a plurality of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
  • the storage medium may include: a read only memory, a random read memory, a magnetic disk or an optical disk, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

本发明公开了一种消息式内存模组的访存装置,包括:读写模块,用于将当前读写周期内待存储的SCBC存储到对应的DRAM中;处理模块,用于对一个内存行中的每个SCBC分别计算一组检错码,对一个内存行中的全部SCBC计算一组纠错码;所述读写模块,还用于将检错码存储在该内存行的第(M+2)个DRAM中,将纠错码存储在该内存行的第Z个DRAM中,Z为正整数且1≤Z≤(M+1),连续(M+1)个内存行中的纠错码分别存储在不同的DRAM中。本发明实施例还提供相应的方法。本发明技术方案以SCBC为基本读写单位进行细粒度编码保护,支持可变粒度访存,可以实现对单个DRAM中任意多位错误进行纠错。

Description

一种消息式内存模组的访存方法和装置 本申请要求于 2013 年 7 月 31 日提交中国专利局、 申请号为 201310330220.6、发明名称为"一种消息式内存模组的访存方法和装置"的中国 专利申请的优先权, 其全部内容通过引用结合在本申请中。
技术领域
本发明涉及通信技术领域, 具体涉及一种消息式内存模组的访存方法和 装置。
背景技术
在计算机系统运行中, 内存的可靠性起着举足轻重的作用, 一方面, 随 着系统配置内存的数目越来越多, 内存系统的失效率会成幂指数级上升; 另 一方面, 随着低电压工作模式技术的引进, 内存发生错误的可能性增大, 错 误数目会增多。
错误检查和纠正 (Error Checking and Correcting, ECC ) 内存是当前普遍 釆用的一种内存可靠性解决方案。 ECC内存,即带有 ECC校验码的内存模组, 其基本思想是以内存模组位宽为基本单位进行数据保护, 以内存模组位宽为 64位为例 ,每次写入 64位数据的同时为该数据计算 8位的校验位存储于独立 的 ECC芯片中,这 64个数据位与 8个校验位一并组成 72位 ECC字,这种编 码方式可以对 72位 ECC字中任意一位出错进行糾正, 但是, 对于两位出错 的情况, 就只能检测而不能纠正, 对于更多位出错的情况, 则更加无能为力 了。
IBM公司在 ECC内存的基础上提出了 Chipkill内存技术。 Chipkill内存的 设计原理基于内存错误的集聚效应倾向于发生在同一块动态随机存取存储器 ( Dynamic Random Access Memory, DRAM )芯片上, Chipkill技术能容忍任 意一个 DRAM芯片的失效。 Chipkill内存的内存控制器 ( Memory Controller, MC )需要同时控制四个带有 ECC的双列直插式内存模组(Dual Inline Memory Module, DIMM )协同工作, MC的位宽由 4个 72位的 ECC字组成, 在每个 ECC字内部都可以纠检一位错误, 每个 DIMM上的 DRAM芯片的位宽必须 是 4位, 经过仔细地设计使同一块 DRAM芯片的 4位输入输出分别映射到 4 个不同的 ECC字中, 通过这样的设计即使一片 DRAM芯片的 4个管脚的数 据全部出错, 4个不同的 ECC字也可以将其恢复, 也就是说 Chipkill技术可 以容忍任意一个 DIMM上任意的一个 DRAM芯片损坏。 Chipkill技术通过更 宽的 MC位宽和更大粒度的数据编码取得了较高的可靠性, 但是, 该技术理 论上只能用于 4位位宽的 DRAM芯片, 缺乏灵活性, 且过大粒度的数据编码 导致它每次读取的数据远远大于实际访存请求的数据, 造成了大量不必要的 功耗损失。
发明内容
本发明实施例提供一种消息式内存模组的访存方法和装置, 以提供一种 低功耗的、 高可靠性的、 可变粒度的内存访存容错解决方案。
本发明第一方面提供一种消息式内存模组的访存装置, 所述内存模组包 括(M+2 )块动态随机存取存储器 DRAM, M等于 2的 m次方, m为正整数; 每 块 DRAM中存储的可在一个读写周期内访存的数据称为单芯片突发簇 SCBC , 全部 DRAM中存储的可在同一个读写周期内访存的数据的集合形成一个内存 行;
所述装置包括:
读写模块, 用于将当前读写周期内待存储的 SCBC存储到对应的 DRAM中 且位于当前内存行中, 用于存储 SCBC的所述 DRAM不包括第 (M+2 ) 个 DRAM;
处理模块, 用于对一个内存行中的每个 SCBC分别计算一组检错码, 对一 个内存行中的全部 SCBC计算一组纠错码;
所述读写模块, 还用于将对一个内存行计算得到的检错码存储在该内存 行的第 ( M+2 )个 DRAM中, 将对一个内存行计算得到的纠错码存储在该内 存行的第 Z个 DRAM中, Z为正整数且 1 Z (M+1 ), 连续(M+1 )个内存行 中的纠错码分别存储在不同的 DRAM中。
在第一种可能的实现方式中, 所述处理模块, 还用于在收到读访存请求 时, 指示所述读写模块从当前的内存行中读取需要的 SCBC以及对应的检错 码, 根据所述检错码对读取到的 SCBC进行校验, 判断是否有 SCBC错误, 在 判断有 SCBC错误时, 获取发生错误的 SCBC的个数, 若有且仅有一个 SCBC发 生错误, 则指示所述读取模块读取该内存行的所有数据, 根据该内存行中的 纠错码与其它未发生错误的 SCBC , 对发生的错误的 SCBC进行恢复。
结合第一方面或者第一方面的第一种可能的实现方式, 在第二种可能的 实现方式中, 所述处理模块, 还用于在收到写访存请求时, 首先判断待写入 的第二 SCBC的个数 X是否小于等于 M/2; 若 X小于等于 M/2, 则指示所述读写 模块读取当前的内存行中存储的第一检错码和第一纠错码以及待写入第二 SCBC的 X个 DRAM中存储的 X个第一 SCBC, 根据所述第一检错码判断是否有 第一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检错 码, 以及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第二 SCBC计 算第二纠错码, 指示所述读写模块将所述 X个第二 SCBC和第二纠错码以及第 二检错码写入对应的 DRAM中; 若 X大于 M/2, 则指示所述读写模块读取当前 的内存行中存储的第一检错码和第一纠错码以及不应被写入第二 SCBC的 ( M-X )个 DRAM中存储的第一 SCBC, 根据所述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检错码, 以及根据(M-X )个第一 SCBC和 X个第二 SCBC计算第二纠错码, 指示所述读 写模块将所述 X个第二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM 中。
结合第一方面的第二种可能的实现方式, 在第三种可能的实现方式中, 所述处理模块,还用于在判断有第一 SCBC错误时,获取发生错误的第一 SCBC 的个数, 若有且仅有一个第一 SCBC发生错误, 则指示所述读取模块读取该内 存行的所有数据, 根据该内存行中的纠错码与其它未发生错误的 SCBC, 对发 生的错误的第一 SCBC进行恢复。
本发明第二方面提供一种消息式内存模组的访存方法, 所述内存模组包 括(M+2 )块动态随机存取存储器 DRAM, M等于 2的 m次方, m为正整数; 每 块 DRAM中存储的可在一个读写周期内访存的数据称为单芯片突发簇 SCBC , 全部 DRAM中存储的可在同一个读写周期内访存的数据的集合形成一个内存 行;
所述方法包括:
将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且位于当前内 存行中, 用于存储 SCBC的所述 DRAM不包括第 (M+2 )个 DRAM;
对一个内存行中的每个 SCBC分别计算一组检错码, 对一个内存行中的全 部 SCBC计算一组纠错码;
将对一个内存行计算得到的检错码存储在该内存行的第( M+2 )个 DRAM 中, 将对一个内存行计算得到的糾错码存储在该内存行的第 Z个 DRAM中, Z 为正整数且 1 Z ( M+1 ), 连续 (M+1 )个内存行中的纠错码分别存储在不 同的 DRAM中。
在第一种可能的实现方式中, 所述方法还包括: 在收到读访存请求时, 从当前的内存行中读取需要的 SCBC以及对应的检错码, 根据所述检错码对读 取到的 SCBC进行校验, 判断是否有 SCBC错误, 在判断有 SCBC错误时, 获取 发生错误的 SCBC的个数, 若有且仅有一个 SCBC发生错误, 则读取该内存行 的所有数据, 根据该内存行中的纠错码与其它未发生错误的 SCBC, 对发生的 错误的 SCBC进行恢复。
结合第一方面或者第一方面的第一种可能的实现方式, 在第二种可能的 实现方式中,所述方法还包括:在收到写访存请求时,判断待写入的第二 SCBC 的个数 X是否小于等于 M/2; 若 X小于等于 M/2, 则读取当前的内存行中存储的 第一检错码和第一纠错码以及待写入第二 SCBC的 X个 DRAM中存储的 X个第 一 SCBC, 根据所述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错 误时, 计算所述 X个第二 SCBC的第二检错码, 以及根据所述第一纠错码和所 述 X个第一 SCBC以及所述 X个第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM中; 若 X大于 M/2, 则 读取当前的内存行中存储的第一检错码和第一纠错码以及不应被写入第二 SCBC的 ( M-X ) 个 DRAM中存储的第一 SCBC, 根据所述第一检错码判断是 否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二 检错码, 以及根据(M-X )个第一 SCBC和 X个第二 SCBC计算第二纠错码, 将 所述 X个第二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM中。
结合第一方面的第二种可能的实现方式, 在第三种可能的实现方式中, 所述的判断是否有 SCBC错误之后还包括: 获取发生错误的第一 SCBC的个数, 若有且仅有一个第一 SCBC发生错误, 则读取该内存行的所有数据, 根据该内 存行中的纠错码与其它未发生错误的 SCBC , 对发生的错误的第一 SCBC进行 恢复。
本发明实施例技术方案以 SCBC为基本读写单位进行细粒度编码保护, 但对于 SCBC的大小没有特别限定, 即, 对 DRAM内存芯片的位宽和突发长 度不做限定, 因此可以支持可变粒度访存, 并且, 利用独立的 DRAM存储检 错码可以对不同访存粒度实现检错, 检错码配合糾错码可以实现对单个 DRAM中任意多位错误进行纠错, 是一种低功耗高可靠性的解决方案。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实 施例或现有技术描述中所需要使用的附图作简单地介绍, 显而易见地, 下面 描述中的附图仅仅是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。
图 1是本发明实施例提供的消息式内存模组的访存装置的示意图; 图 2是本发明实施例提供的消息式内存模组的访存装置的原理图; 图 3是本发明实施例提供的消息式内存模组的访存方法的流程图; 图 4a和 4b是本发明实施例提供的内存控制系统的示意图。
具体实施方式
本发明实施例提供一种消息式内存模组的访存方法, 以提供一种低功耗 的、 高可靠性的、 可变粒度的内存访存容错解决方案。 本发明实施例还提供 相应的装置。 以下分别进行详细说明。 实施例一、
请参考图 1 , 本发明实施例提供一种消息式内存模组的访存装置。 该装置 可以部署在内存模组的外围控制电路中, 也可以部署在内存控制器中; 内存 控制器则可以集成中央处理器 (CentralProcessingUnit, CPU ) 中或者集成在 计算机主板上。
所说的内存模组具体可以是 DIMM, 该 DIMM包括多个 DRAM。 本文中 假定 DRAM的位宽为 N位, N等于 2的 n次方, n为正整数; DRAM的突发 长度( Burst Length, BL )为 Q, Q为正整数, 优选 Q等于 2的若干次方, 例 如等于 4或 8; 假定 DIMM包括( M+2 )块 DRAM, ( M x N )即是整个内存 模组的位宽, 由于计算机内存模组位宽(M x N )—般是 2的若干次方, 例如 通常是 32或 64, 因此, M也是 2的若干次方, 可以记 M等于 2的 m次方, m为正整数; 在计算机内存模组的位宽已确定时, 可以根据单个 DRAM的位 宽确定所需要的 DRAM的个数。
本实施例中, 每块 DRAM中存储的可在一个读写周期内访存的数据称为 单芯片突发簇( Single Chip Burst Cluster, SCBC ) , 每个 SCBC包括( N x Q ) bit数据,全部 DRAM中存储的可在同一个读写周期内访存的数据即 SCBC的 集合形成一个内存行。 每一块 DRAM也可以视为一个内存列。 换句话说, 一 个内存行表示一个读写周期, 一个内存列表示一块 DRAM。
所说的访存装置 100包括读写模块 110和处理模块 120; 其中,
所述读写模块 110,用于将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且位于当前内存行中, 用于存储 SCBC的 DRAM不包括第 (M+2 ) 个 DRAM; —个读写周期内待存储的 SCBC的个数不超过 M;
所述处理模块 120,用于对一个内存行中的每个 SCBC分别计算一组检错 码, 对一个内存行中的全部 SCBC计算一组纠错码;
所述读写模块 no,还用于将对一个内存行计算得到的检错码存储在该内 存行的第 (M+2 )个 DRAM中, 将对一个内存行计算得到的纠错码存储在该 内存行的第 Z个 DRAM中, Z为正整数且 1 Z M+1 ) , 连续(M+1 )个内 存行中的纠错码分别存储在不同的 DRAM中。
可选的, 所述处理模块 120, 还用于在收到读访存请求时, 指示所述读写 模块从当前的内存行中读取需要的 SCBC以及对应的检错码, 根据所述检错 码对读取到的 SCBC进行校验, 判断是否有 SCBC错误, 在判断有 SCBC错 误时, 获取发生错误的 SCBC的个数, 若有且仅有一个 SCBC发生错误, 则 指示所述读取模块读取该内存行的所有数据, 根据该内存行中的纠错码与其 它未发生错误的 SCBC,对发生的错误的 SCBC进行恢复。若有两个或者两个 以上 SCBC发生错误, 则处理模块 120向主控设备例如内存控制器报告不可 恢复错误。
可选的, 所述处理模块 120,还用于首先判断待写入的第二 SCBC的个数 X是否小于等于 M/2; 若 X小于等于 M/2, 则指示所述读写模块读取当前的 内存行中存储的第一检错码和第一纠错码以及待写入第二 SCBC 的 X 个 DRAM中存储的 X个第一 SCBC,根据所述第一检错码判断是否有第一 SCBC 错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检错码, 以 及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第二 SCBC计算 第二纠错码, 指示所述读写模块将所述 X个第二 SCBC和第二纠错码以及第 二检错码写入对应的 DRAM中; 若 X大于 M/2, 则指示所述读写模块读取当 前的内存行中存储的第一检错码和第一纠错码以及不应被写入第二 SCBC的 ( M-X )个 DRAM中存储的第一 SCBC, 根据所述第一检错码判断是否有第 一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检 错码, 以及根据 ( M-X )个第一 SCBC和 X个第二 SCBC计算第二纠错码, 指示所述读写模块将所述 X个第二 SCBC和第二纠错码以及第二检错码写入 对应的 DRAM中。
进一步的, 所述处理模块 120, 还用于在判断有第一 SCBC错误时, 获取 发生错误的第一 SCBC的个数, 若有且仅有一个第一 SCBC发生错误, 则指 示所述读取模块读取该内存行的所有数据, 根据该内存行中的纠错码与其它 未发生错误的 SCBC, 对发生的错误的第一 SCBC进行恢复。 在有两个以上 DRAM中的 SCBC发生错误时, 所述处理模块 120向主控设备例如内存控制 器报告不可恢复错误。
上述的第一 SCBC、 第一纠错码和第一检错码可以理解为当前存储的 SCBC, 纠错码和检错码; 第二 SCBC、 第二纠错码和第二检错码则是将要重 新写入的 SCBC、 纠错码和检错码, 这些数据将会覆盖上述的第一 SCBC、 第 一纠错码和第一检错码。
一般的, 目前计算机系统的内存模组通常是 64位,假设选用位宽 N定于 8的 DRAM,则用于存储 SCBC的 DRAM的数量应为 64除以 8等于 8。下面, 以 M和 N均等于 8为例, 对本发明实施例装置 #文进一步详细说明:
当 M和 N均等于 8时, 所述内存模组的数据编码结构如图 2所示, 图中 10列分别表示第一至第 (M+2 ) DRAM。 其中, 第 ( M+2 ) DRAM即图中最 后一列用于存放检错码, 第一至第( M+1 ) DRAM中的一块 DRAM用于存放 纠错码, 另外八块 DRAM用于存放 SCBC。假设每块 DRAM的突发长度也是 8位, 则每个 SCBC包括 8bit*8=64bit数据。 图中, 用 Data ( D )表示访存数 据, 用 Parity ( P )表示纠错码, 用 Checksum ( C )表示检错码。 同一个内存 行中存储的 SCBC分别用 D0~D7表示, 连续 (M+1 )个内存行中存储的纠错 码分别用 P0~P8表示, 连续 (M+1 )个内存行中存储的检错码分别用 C0~C8 表示。 广义上, 所述的访存数据, 纠错码和检错码都是 SCBC, 但在本文中, 所述 SCBC专指所述的访存数据( D ) 。
本实施例中的容错编码分为检错码和糾错码两部分, 其中, 检错码部分 的计算包括: 对同一内存行上的 8个 SCBC依次计算校验和, 校验和的算法 可以釆用 8位模 2加法或 8位二进制反码和, 每个 SCBC计算得到 8位校验 码,同一行的 8个 SCBC计算得到 64位校验码,存储在专用的第( M+2 )DRAM 芯片的第一内存行对应位置, 作为检错码; 纠错码部分的计算包括: 对同一 内存行中 8个 SCBC的对应位进行异或运算, 运算产生 64位奇偶校验码, 即 所述纠错码 P, 如 D0~D7对应位异或得到纠错码 P0存储在第 ( M+1 ) DRAM 的同一行; 需要指出, 连续多个内存行的纠错码分别存储在不同的 DRAM, 呈条带状分布于第一至第 (M+1 ) DRAM的内存数据区用以纠错。
纠错码呈条带状分布设计的原因在于: 如果所有内存行的纠错码 P都保 存在同一个 DRAM芯片上, 当连续写数据时都需要更新该芯片上相对应位置 的纠错码 P, 会使该 DRAM芯片成为访问热点, 导致两个写操作之间延迟增 加, 更新性能下降; 并且, 写入纠错码 P时, 需要预读整个内存行上存储的 全部数据, 会进一步导致性能下降; 而通过使糾错码呈条带状分布, 可以解 决上述问题。
容错编码详细的配置参数如表 1所示。
表 1
Figure imgf000011_0001
针对不同的访存请求, 本实施例装置会执行读访存处理和写访存处理以 及错误恢复处理这三种处理时序 , 以下分别详细说明:
Α、 读访存处理
当读访存请求到来, 处理模块需要根据读取粒度激活响应的若干个 DRAM芯片, 读取 SCBC数据, 同时, 一并读取当前内存行的检错码, 对读 取的 SCBC数据进行校验, 如果没有 SCBC错误则继续进行读取, 将已读取 的数据上传; 如果有且仅有一个 DRAM芯片上有 SCBC错误则进行错误恢复 处理;如果多个 DRAM芯片上有 SCBC错误则向主控设备报告不可修复错误。 所述的访存粒度是指需要访存的 SCBC数据的个数。
例如:读访存请求 RD( D0,D1,D2 )请求读取 D0-D2共计 3个 SCBC 24byte 数据, 需要激活 4个 DRAM芯片, 读取 D0-D2以及 CO, 根据 CO的前 3个字 节的检错码校验 D0-D2, 如果有任意一个 SCBC出错则进入错误恢复处理。
B、 写访存处理
当写访存请求到来, 处理模块首先判断该写访存请求的访存粒度。 所述 的访存粒度是指需要访存的 SCBC数据的个数。
如果该写访存请求的访存粒度, 即待写入的 SCBC数据的个数 X小于或 等于 M/2, 则读取该写访存请求将要写入数据位置的原 SCBC数据, 一并读 取该行的原纠错码以及原检错码, 利用原检错码校验读出的原 SCBC数据, 如果有且只有一个 SCBC错误则进行错误恢复处理, 如果没有错误则将原纠 错码 p与读到的原 SCBC数据以及将要写入的新 SCBC数据进行异或后即得 到更新后的新纠错码 Ρ' , 对写入的新 SCBC数据同时计算新检错码 C' , 指示 所述读写模块将新 SCBC和新纠错码以及新检错码写入对应的 DRAM中的对 应位置。
例如: 收到写访存请求 WR ( D0,,D1,,D2,) , 请求在原 DO, Dl , D2的 位置写入 DO' , Dl' , D2', 3个 SCBC共计 24byte数据, 判断写入粒度 3个 SCBC小于总共 8个 SCBC的一半, 则首先读取原数据 DO, Dl , D2, 以及同 一行的校验码 P0 及 CO , 根据 CO 进行校验, 无误则计算新的 P0'=P0AD0AD1AD2AD0,AD1,AD2' , 计算 DO' , Dl' , D2'的校验和更新 CO的到 CO' , 最后一并写入 D0,, Dl' , D2,, ΡΟ' , C0,。
另一种情况, 如果该写访存请求的访存粒度, 即待写入的 SCBC数据的 个数 X大于 M/2, 则读取在本次写访存中该内存行不应被修改的数据 SCBC, 一并读取该行的纠错码 P以及检错码 C, 经过检错码 C校验读出的原数据如 果有错误则进入错误恢复处理, 无误则将要写入的数据与刚读出本内存行不 应更改的数据进行异或得到更新后的纠错码 P' , 对写入的数据同时计算检错 码写入对应 DRAM的对应位置。
例如:收到写访存请求 WR ( D0,,Dr,D2,,D3,,D4,,D5,),请求在原 D0-D5 的位置写入 D0'-D5', 6个 SCBC共计 48byte数据, 判断写入粒度 6个 SCBC 大于总共 8个 SCBC的一半, 则首先读取该行中写后不更改的数据 D6, D7 , 以及同一行的校验码 P0及 CO,根据 CO校验 D6与 D7,无误则计算新的 P0,= D0,AD1,AD2,AD3,AD4,AD5,AD6AD7 ,计算 D0,-D5,的校验和更新 CO得到 CO' , 最后一并写入 D0'-D5,, ΡΟ' , C0'。
综上, 写访存之前, 需要一次小于等于一半位宽的数据读取, 之后进行 纠错码与检错码的更新, 最后再一并写入。 其中, 所述的原 SCBC数据和原 纠错码以及原检错码是指内存模组中原先存储的 SCBC数据和纠错码以及检 错码, 即, 上文中所述的第一 SCBC数据和第一纠错码以及第一检错码; 所 述的新 SCBC数据和新纠错码以及新检错码是更新后的 SCBC数据和纠错码 以及检错码, 即, 上文中所述的第二 SCBC数据和第二纠错码以及第二检错 码。
C、 错误恢复处理
当检错码校验数据出错后, 还要获取出现错误的 SCBC 的个数, 如果同 时多位错出只现在一个 SCBC中, 即同一时间单个 DRAM芯片出错的情况, 本实施例装置进行错误恢复处理。 首先, 读取出错的 SCBC所在内存行的所 有数据, 包括未出错的 SCBC, 检错码 C, 纠错码 P, 根据检错码 C校验读出 的其他同行数据, 无误则将所有其他正确的 SCBC与纠错码 P进行异或恢复 得到整块正确数据, 最后重新写入正确数据。
例如, 如果 DO 出错, 需要恢复, 首先读取同一内存行的 D1-D7, P0, CO, 用检错码 CO校验 D1-D7 , 如果还有其他数据 SCBC有错, 则向设备报 告 不 可 恢 复 错 误 , 如 果 D1-D7 正 确 , 则 计 算 D0'=D1AD2AD3AD4AD5AD6AD7AP0 ,得到正确的 D0,,最后写入正确数据 D0,, 至此恢复结束。
本发明实施例技术方案的具体配置具有一定的灵活性, 可以根据不同的 体系结构设计通过调整使用芯片的位宽与芯片的突发长度以及突发长度簇的 组数来设置编码粒度的大小。 对于更细粒度的访存容错编码设计可以使用 x4 位宽 4次突发长度的设计, 使 SCBC块容量缩减为 16位。 对于更大粒度的访 存容错编码设计可以使用 xl6位宽配合两组 8次突发长度, 用两个读写周期 的时间一次最少读取 256位的数据。 一些实施方式中, 也可以使用 x8位宽 4 次突发长度的设计, 使 SCBC块容量缩减为 32位, 或者可以使用 xl6位宽 8 次突发长度的设计, 使 SCBC块容量缩减为 128位, 或者是其它方式。 其它 实施方式中, 位宽和突发长度还可以随意选择其它组合, 其中, 位宽一般为 2 的若干次方。
以上, 本发明实施例提供了一种消息式内存模组的访存装置, 该装置以 SCBC为基本读写单位进行细粒度编码保护,但对于 SCBC的大小没有特别限 定, 即, 对 DRAM内存芯片的位宽和突发长度不做限定, 因此可以支持可变 粒度访存, 支持可变数量的 DRAM访存, 并且, 利用独立的 DRAM存储检 错码可以对不同访存粒度实现检错, 检错码配合糾错码可以实现对单个 DRAM中任意多位错误进行纠错, 是一种低功耗高可靠性的解决方案。
本实施例技术方案可实现任意单芯片多错误容错, 相比 chipkill技术每次 读写激活芯片数下降 44~83% ( 16~30/36 ) , 激活芯片数据下降意味着动态功 耗同比下降, 本专利以更低的功耗开销实现 Chipkill级别的容错保护, 并且没 有 Chipkill实现的芯片位宽限制, 更易于扩容。 除此之外, 本专利基于消息式 内存的设计, 实现了细粒度的保护, 容错粒度可变, 为上层体系结构设计提 供了更多优化空间。
实施例二、
请参考图 3 , 本发明实施例还提供一种消息式内存模组的访存方法。 该方 法的执行主体是内存模组的外围控制电路或者内存控制器, 具体是所述外围 控制电路或者内存控制器中部署的如实施例一所述的访存装置。 所述内存模 组包括(M+2 )块动态随机存取存储器 (DRAM ) , M等于 2的 m次方, m 为正整数; 每块 DRAM中存储的可在一个读写周期内访存的数据称为单芯片 突发簇 SCBC, 全部 DRAM中存储的可在同一个读写周期内访存的数据的集 合形成一个内存行。
所述方法包括:
210、 将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且位于 当前内存行中, 用于存储 SCBC的所述 DRAM不包括第 (M+2 )个 DRAM;
220、 对一个内存行中的每个 SCBC分别计算一组检错码, 对一个内存行 中的全部 SCBC计算一组纠错码;
230、 将对一个内存行计算得到的检错码存储在该内存行的第 (M+2 )个 DRAM 中, 将对一个内存行计算得到的纠错码存储在该内存行的第 Z 个 DRAM中, Z为正整数且 1 Z ( M+1 ) , 连续 (M+1 )个内存行中的纠错 码分别存储在不同的 DRAM中。
可选的, 所述方法还包括:
在收到读访存请求时, 从当前的内存行中读取需要的 SCBC 以及对应的 检错码, 对读取到的 SCBC进行校验, 根据所述检错码判断是否有 SCBC错 误, 在判断有 SCBC错误时, 获取发生错误的 SCBC的个数, 若有且仅有一 个 SCBC发生错误, 则读取该内存行的所有数据, 根据该内存行中的纠错码 与其它未发生错误的 SCBC, 对发生的错误的 SCBC进行恢复。
可选的, 所述方法还包括:
在收到写访存请求时, 判断待写入的第二 SCBC的个数 X是否小于等于
M/2;
若 X小于等于 M/2, 则读取当前的内存行中存储的第一检错码和第一纠 错码以及待写入第二 SCBC的 X个 DRAM中存储的 X个第一 SCBC,根据所 述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所 述 X个第二 SCBC的第二检错码, 以及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和 第二纠错码以及第二检错码写入对应的 DRAM中;
若 X大于 M/2, 则读取当前的内存行中存储的第一检错码和第一纠错码 以及不应被写入第二 SCBC的 (M-X )个 DRAM中存储的第一 SCBC, 根据 所述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算 所述 X个第二 SCBC的第二检错码, 以及根据 ( M-X )个第一 SCBC和 X个 第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和第二纠错码以及第二 检错码写入对应的 DRAM中。
可选的, 有第一 SCBC错误时, 所述方法还包括:
获取发生错误的第一 SCBC的个数, 若有且仅有一个第一 SCBC发生错 误, 则读取该内存行的所有数据, 根据该内存行中的纠错码与其它未发生错 误的 SCBC, 对发生的错误的第一 SCBC进行恢复。
以上, 本发明实施例提供了一种消息式内存模组的访存方法, 该方法以 SCBC为基本读写单位进行细粒度编码保护,但对于 SCBC的大小没有特别限 定, 即, 对 DRAM内存芯片的位宽和突发长度不做限定, 因此可以支持可变 粒度访存, 支持可变数量的 DRAM访存, 并且, 利用独立的 DRAM存储检 错码可以对不同访存粒度实现检错, 检错码配合糾错码可以实现对单个 DRAM中任意多位错误进行纠错, 是一种低功耗高可靠性的解决方案。
实施例三、
本发明实施例还提供一种内存控制系统。
一种实施方式中, 如图 4a所示, 该系统包括消息式内存模组 310, 所述 内存模组 310包括外围控制电路 3101和(M+2 )块 DRAM3102;
其中, M等于 2的 m次方, m为正整数; 每块 DRAM中存储的可在一 个读写周期内访存的数据称为单芯片突发簇 SCBC, 全部 DRAM中存储的可 在同一个读写周期内访存的数据的集合形成一个内存行。 所述内存控制器可 以集成在计算机主板上或者集成在计算机的 CPU中。
所述外围控制电路执行以下步骤: 将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且位于当前内 存行中, 用于存储 SCBC的所述 DRAM不包括第 ( M+2 )个 DRAM; 对一个 内存行中的每个 SCBC分别计算一组检错码, 对一个内存行中的全部 SCBC 计算一组糾错码; 将对一个内存行计算得到的检错码存储在该内存行的第 ( M+2 )个 DRAM中, 将对一个内存行计算得到的纠错码存储在该内存行的 第 Z个 DRAM中, Z为正整数且 1 < Z < ( M+1 ) , 连续(M+1 )个内存行中 的纠错码分别存储在不同的 DRAM中。
可选的, 所述外围控制电路还执行以下步骤:
在收到读访存请求时, 从当前的内存行中读取需要的 SCBC 以及对应的 检错码, 对读取到的 SCBC进行校验, 根据所述检错码判断是否有 SCBC错 误, 在判断有 SCBC错误时, 获取发生错误的 SCBC的个数, 若有且仅有一 个 SCBC发生错误, 则读取该内存行的所有数据, 根据该内存行中的纠错码 与其它未发生错误的 SCBC, 对发生的错误的 SCBC进行恢复。
可选的, 所述外围控制电路还执行以下步骤:
在收到写访存请求时, 判断待写入的第二 SCBC的个数 X是否小于等于 M/2; 若 X小于等于 M/2, 则读取当前的内存行中存储的第一检错码和第一纠 错码以及待写入第二 SCBC的 X个 DRAM中存储的 X个第一 SCBC,根据所 述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所 述 X个第二 SCBC的第二检错码, 以及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和 第二纠错码以及第二检错码写入对应的 DRAM中; 若 X大于 M/2, 则读取当 前的内存行中存储的第一检错码和第一纠错码以及不应被写入第二 SCBC的 ( M-X )个 DRAM中存储的第一 SCBC, 根据所述第一检错码判断是否有第 一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检 错码, 以及根据 ( M-X )个第一 SCBC和 X个第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM中。
其中, 有第一 SCBC错误时, 还获取发生错误的第一 SCBC的个数, 若 有且仅有一个第一 SCBC发生错误, 则读取该内存行的所有数据, 根据该内 存行中的纠错码与其它未发生错误的 SCBC,对发生的错误的第一 SCBC进行 恢复。
另一种实施方式中, 如图 4b所示, 该系统包括消息式内存模组 310和内 存控制器 320 , 所述内存模组 310包括( M+2 )块 DRAM3102;
其中, M等于 2的 m次方, m为正整数; 每块 DRAM中存储的可在一 个读写周期内访存的数据称为单芯片突发簇 SCBC, 全部 DRAM中存储的可 在同一个读写周期内访存的数据的集合形成一个内存行。 所述内存控制器可 以集成在计算机主板上或者集成在计算机的 CPU中。
所述内存控制器执行以下步骤:
将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且位于当前内 存行中, 用于存储 SCBC的所述 DRAM不包括第 ( M+2 )个 DRAM; 对一个 内存行中的每个 SCBC分别计算一组检错码, 对一个内存行中的全部 SCBC 计算一组糾错码; 将对一个内存行计算得到的检错码存储在该内存行的第 ( M+2 )个 DRAM中, 将对一个内存行计算得到的纠错码存储在该内存行的 第 Z个 DRAM中, Z为正整数且 1 < Z < ( M+1 ) , 连续(M+1 )个内存行中 的纠错码分别存储在不同的 DRAM中。
可选的, 所述内存控制器还执行以下步骤:
在收到读访存请求时, 从当前的内存行中读取需要的 SCBC 以及对应的 检错码, 对读取到的 SCBC进行校验, 根据所述检错码判断是否有 SCBC错 误, 在判断有 SCBC错误时, 获取发生错误的 SCBC的个数, 若有且仅有一 个 SCBC发生错误, 则读取该内存行的所有数据, 根据该内存行中的纠错码 与其它未发生错误的 SCBC, 对发生的错误的 SCBC进行恢复。
可选的, 所述内存控制器还执行以下步骤:
在收到写访存请求时, 判断待写入的第二 SCBC的个数 X是否小于等于 M/2; 若 X小于等于 M/2, 则读取当前的内存行中存储的第一检错码和第一纠 错码以及待写入第二 SCBC的 X个 DRAM中存储的 X个第一 SCBC,根据所 述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所 述 X个第二 SCBC的第二检错码, 以及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和 第二纠错码以及第二检错码写入对应的 DRAM中; 若 X大于 M/2, 则读取当 前的内存行中存储的第一检错码和第一纠错码以及不应被写入第二 SCBC的 ( M-X )个 DRAM中存储的第一 SCBC, 根据所述第一检错码判断是否有第 一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检 错码, 以及根据 ( M-X )个第一 SCBC和 X个第二 SCBC计算第二纠错码, 将所述 X个第二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM中。
其中, 有第一 SCBC错误时, 还获取发生错误的第一 SCBC的个数, 若 有且仅有一个第一 SCBC发生错误, 则读取该内存行的所有数据, 根据该内 存行中的纠错码与其它未发生错误的 SCBC,对发生的错误的第一 SCBC进行 恢复。
以上, 本发明实施例提供了一种内存控制系统, 该内存模组以 SCBC为 基本读写单位进行细粒度编码保护, 但对于 SCBC的大小没有特别限定, 即, 对 DRAM内存芯片的位宽和突发长度不做限定,因此可以支持可变粒度访存, 支持可变数量的 DRAM访存, 并且, 利用独立的 DRAM存储检错码可以对 不同访存粒度实现检错, 检错码配合纠错码可以实现对单个 DRAM中任意多 位错误进行纠错, 是一种低功耗高可靠性的解决方案。
所属领域的技术人员可以清楚地了解到, 为描述的方便和简洁, 仅以上 述各功能模块的划分进行举例说明, 实际应用中, 可以根据需要而将上述功 能分配由不同的功能模块完成, 即将装置的内部结构划分成不同的功能模块, 以完成以上描述的全部或者部分功能。 上述描述的系统, 装置和单元的具体 工作过程, 可以参考前述方法实施例中的对应过程, 在此不再赘述。
在本申请所提供的几个实施例中, 应该理解到, 所揭露的系统, 装置和 方法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施例仅仅是示 意性的, 例如, 所述模块或单元的划分, 仅仅为一种逻辑功能划分, 实际实 现时可以有另外的划分方式, 例如多个单元或组件可以结合或者可以集成到 另一个系统, 或一些特征可以忽略, 或不执行。 另一点, 所显示或讨论的相 互之间的耦合或直接耦合或通信连接可以是通过一些接口, 装置或单元的间 接耦合或通信连接, 可以是电性, 机械或其它的形式。 为单元显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或 者全部单元来实现本实施例方案的目的。
另外, 在本申请各个实施例中的各功能单元可以集成在一个处理单元中, 也可以是各个单元单独物理存在, 也可以两个或两个以上单元集成在一个单 元中。 上述集成的单元既可以釆用硬件的形式实现, 也可以釆用软件功能单 元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售 或使用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本 申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的 全部或部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个 存储介质中, 包括若干指令用以使得一台计算机设备(可以是个人计算机, 服务器, 或者网络设备等)或处理器(processor )执行本申请各个实施例所述 方法的全部或部分步骤。 而前述的存储介质包括: U盘、 移动硬盘、 只读存 储器(ROM, Read-Only Memory )、 随机存取存储器(RAM, Random Access Memory ) 、 磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步 骤可以通过硬件来完成, 也可以通过程序指令相关的硬件来完成, 该程序可 以存储于一计算机可读存储介质中, 存储介质可以包括: 只读存储器、 随机 读取存储器、 磁盘或光盘等。
以上对本发明实施例所提供的一种消息式内存模组的访存方法和装置进 行了详细介绍, 但以上实施例的说明只是用于帮助理解本发明的方法及其核 心思想, 不应理解为对本发明的限制。 本技术领域的技术人员在本发明揭露 的技术范围内, 可轻易想到的变化或替换, 都应涵盖在本发明的保护范围之 内。

Claims

权 利 要求 书
1、 一种消息式内存模组的访存装置, 其特征在于:
所述内存模组包括(M+2 )块动态随机存取存储器 DRAM, M等于 2的 m次 方, m为正整数; 每块 DRAM中存储的可在一个读写周期内访存的数据称为单芯 片突发簇 SCBC, 全部 DRAM中存储的可在同一个读写周期内访存的数据的集合 形成一个内存行;
所述装置包括:
读写模块, 用于将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且 位于当前内存行中, 用于存储 SCBC的所述 DRAM不包括第 (M+2 )个 DRAM; 处理模块, 用于对一个内存行中的每个 SCBC分别计算一组检错码, 对一个 内存行中的全部 SCBC计算一组纠错码;
所述读写模块, 还用于将对一个内存行计算得到的检错码存储在该内存行 的第 (M+2 )个 DRAM中, 将对一个内存行计算得到的纠错码存储在该内存行 的第 Z个 DRAM中, Z为正整数且 1 Z (M+1 ), 连续(M+1 )个内存行中的纠 错码分别存储在不同的 DRAM中。
2、 根据权利要求 1所述的装置, 其特征在于:
所述处理模块, 还用于在收到读访存请求时, 指示所述读写模块从当前的 内存行中读取需要的 SCBC以及对应的检错码, 根据所述检错码对读取到的 SCBC进行校验, 判断是否有 SCBC错误, 在判断有 SCBC错误时, 获取发生错误 的 SCBC的个数, 若有且仅有一个 SCBC发生错误, 则指示所述读取模块读取该 内存行的所有数据, 根据该内存行中的纠错码与其它未发生错误的 SCBC, 对发 生的错误的 SCBC进行恢复。
3、 根据权利要求 1所述的装置, 其特征在于:
所述处理模块, 还用于在收到写访存请求时, 首先判断待写入的第二 SCBC 的个数 X是否小于等于 M/2;
若 X小于等于 M/2 , 则指示所述读写模块读取当前的内存行中存储的第一检 错码和第一纠错码以及待写入第二 SCBC的 X个 DRAM中存储的 X个第一 SCBC, 根据所述第一检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算 所述 X个第二 SCBC的第二检错码, 以及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第二 SCBC计算第二纠错码, 指示所述读写模块将所述 X个第 二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM中;
若 X大于 M/2 , 则指示所述读写模块读取当前的内存行中存储的第一检错码 和第一纠错码以及不应被写入第二 SCBC的 (M-X ) 个 DRAM中存储的第一 SCBC,根据所述第一检错码判断是否有第一 SCBC错误,没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检错码, 以及根据 ( M-X )个第一 SCBC和 X个第 二 SCBC计算第二纠错码,指示所述读写模块将所述 X个第二 SCBC和第二纠错码 以及第二检错码写入对应的 DRAM中。
4、 根据权利要求 3所述的装置, 其特征在于:
所述处理模块, 还用于在判断有第一 SCBC错误时, 获取发生错误的第一 SCBC的个数, 若有且仅有一个第一 SCBC发生错误, 则指示所述读取模块读取 该内存行的所有数据, 根据该内存行中的纠错码与其它未发生错误的 SCBC, 对 发生的错误的第一 SCBC进行恢复。
5、 一种消息式内存模组的访存方法, 其特征在于:
所述内存模组包括(M+2 )块动态随机存取存储器 DRAM, M等于 2的 m次 方, m为正整数; 每块 DRAM中存储的可在一个读写周期内访存的数据称为单芯 片突发簇 SCBC, 全部 DRAM中存储的可在同一个读写周期内访存的数据的集合 形成一个内存行;
所述方法包括:
将当前读写周期内待存储的 SCBC存储到对应的 DRAM中且位于当前内存 行中, 用于存储 SCBC的所述 DRAM不包括第 ( M+2 )个 DRAM;
对一个内存行中的每个 SCBC分别计算一组检错码, 对一个内存行中的全部 SCBC计算一组纠错码;
将对一个内存行计算得到的检错码存储在该内存行的第 (M+2 )个 DRAM 中, 将对一个内存行计算得到的纠错码存储在该内存行的第 Z个 DRAM中, Z为 正整数且 1 Z (M+1 ), 连续 (M+1 )个内存行中的纠错码分别存储在不同的 DRAM中。
6、 根据权利要求 5所述的方法, 其特征在于, 还包括:
在收到读访存请求时, 从当前的内存行中读取需要的 SCBC以及对应的检错 码, 根据所述检错码对读取到的 SCBC进行校验, 判断是否有 SCBC错误, 在判 断有 SCBC错误时, 获取发生错误的 SCBC的个数, 若有且仅有一个 SCBC发生错 误, 则读取该内存行的所有数据, 根据该内存行中的糾错码与其它未发生错误 的 SCBC, 对发生的错误的 SCBC进行恢复。
7、 根据权利要求 5所述的方法, 其特征在于, 还包括:
在收到写访存请求时, 判断待写入的第二 SCBC的个数 X是否小于等于 M/2; 若 X小于等于 M/2 , 则读取当前的内存行中存储的第一检错码和第一纠错码 以及待写入第二 SCBC的 X个 DRAM中存储的 X个第一 SCBC, 根据所述第一检错 码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC 的第二检错码, 以及根据所述第一纠错码和所述 X个第一 SCBC以及所述 X个第 二 SCBC计算第二纠错码,将所述 X个第二 SCBC和第二纠错码以及第二检错码写 入对应的 DRAM中;
若 X大于 M/2 , 则读取当前的内存行中存储的第一检错码和第一纠错码以及 不应被写入第二 SCBC的 (M-X )个 DRAM中存储的第一 SCBC, 根据所述第一 检错码判断是否有第一 SCBC错误, 没有第一 SCBC错误时, 计算所述 X个第二 SCBC的第二检错码, 以及根据(M-X )个第一 SCBC和 X个第二 SCBC计算第二 纠错码, 将所述 X个第二 SCBC和第二纠错码以及第二检错码写入对应的 DRAM 中。
8、 根据权利要求 7所述的方法, 其特征在于, 所述的判断是否有 SCBC错 误之后还包括:
获取发生错误的第一 SCBC的个数, 若有且仅有一个第一 SCBC发生错误, 则读取该内存行的所有数据, 根据该内存行中的纠错码与其它未发生错误的 SCBC, 对发生的错误的第一 SCBC进行恢复。
PCT/CN2014/083464 2013-07-31 2014-07-31 一种消息式内存模组的访存方法和装置 WO2015014301A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP14832194.6A EP3015986B1 (en) 2013-07-31 2014-07-31 Access method and device for message-type memory module
KR1020167003393A KR101837318B1 (ko) 2013-07-31 2014-07-31 메시지-타입 메모리 모듈을 위한 액세스 방법 및 디바이스
US15/010,326 US9811416B2 (en) 2013-07-31 2016-01-29 Memory access method and apparatus for message-type memory module

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310330220.6A CN104347122B (zh) 2013-07-31 2013-07-31 一种消息式内存模组的访存方法和装置
CN201310330220.6 2013-07-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/010,326 Continuation US9811416B2 (en) 2013-07-31 2016-01-29 Memory access method and apparatus for message-type memory module

Publications (1)

Publication Number Publication Date
WO2015014301A1 true WO2015014301A1 (zh) 2015-02-05

Family

ID=52431017

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/083464 WO2015014301A1 (zh) 2013-07-31 2014-07-31 一种消息式内存模组的访存方法和装置

Country Status (5)

Country Link
US (1) US9811416B2 (zh)
EP (1) EP3015986B1 (zh)
KR (1) KR101837318B1 (zh)
CN (1) CN104347122B (zh)
WO (1) WO2015014301A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991833B (zh) * 2015-06-15 2018-03-27 联想(北京)有限公司 一种错误检测方法及电子设备
CN105511982A (zh) * 2015-12-18 2016-04-20 山东海量信息技术研究院 一种容忍dram颗粒失效的内存存取方法
BR112019004977A2 (pt) * 2016-09-23 2019-06-04 Sony Corp aparelho sem fio, método de processamento para um aparelho sem fio, e, programa.
KR20180127707A (ko) * 2017-05-22 2018-11-30 에스케이하이닉스 주식회사 메모리 모듈 및 이의 동작 방법
US10606692B2 (en) 2017-12-20 2020-03-31 International Business Machines Corporation Error correction potency improvement via added burst beats in a dram access cycle
US10630423B2 (en) * 2018-01-18 2020-04-21 Chin Pen Chang Two bit error calibration device for 128 bit transfer and the method for performing the same
US10516504B2 (en) * 2018-03-08 2019-12-24 Chin Pen Chang Two bit error calibration device for 256 bit transfer and the method for performing the same
CN110968450A (zh) * 2018-09-30 2020-04-07 长鑫存储技术有限公司 数据存储方法及装置、存储介质、电子设备
EP3647952A1 (en) 2018-10-31 2020-05-06 EM Microelectronic-Marin SA Anti-tearing protection system for non-volatile memories
CN110727401B (zh) * 2019-09-09 2021-03-02 无锡江南计算技术研究所 一种访存系统
CN110718263B (zh) * 2019-09-09 2021-08-10 无锡江南计算技术研究所 芯片访存通路的高效分段测试系统、方法
WO2023106434A1 (ko) * 2021-12-06 2023-06-15 주식회사 딥아이 Ddr sdram 인터페이스를 이용한 dram 지원 에러 정정 방법
CN117238356A (zh) * 2022-06-08 2023-12-15 成都华为技术有限公司 内存模组和电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102203740A (zh) * 2011-05-27 2011-09-28 华为技术有限公司 数据处理方法、装置及系统
CN102456394A (zh) * 2010-10-20 2012-05-16 三星电子株式会社 执行dram刷新操作的存储电路、系统和模块及其操作方法
WO2012081732A1 (en) * 2010-12-15 2012-06-21 Kabushiki Kaisha Toshiba Semiconductor storage device and method of controlling the same

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785785B2 (en) 2000-01-25 2004-08-31 Hewlett-Packard Development Company, L.P. Method for supporting multi-level stripping of non-homogeneous memory to maximize concurrency
US6996766B2 (en) * 2002-06-28 2006-02-07 Sun Microsystems, Inc. Error detection/correction code which detects and corrects a first failing component and optionally a second failing component
US6973613B2 (en) * 2002-06-28 2005-12-06 Sun Microsystems, Inc. Error detection/correction code which detects and corrects component failure and which provides single bit error correction subsequent to component failure
US8041990B2 (en) * 2007-06-28 2011-10-18 International Business Machines Corporation System and method for error correction and detection in a memory system
US8041989B2 (en) * 2007-06-28 2011-10-18 International Business Machines Corporation System and method for providing a high fault tolerant memory system
US20100017650A1 (en) * 2008-07-19 2010-01-21 Nanostar Corporation, U.S.A Non-volatile memory data storage system with reliability management
US8234520B2 (en) * 2009-09-16 2012-07-31 International Business Machines Corporation Wear leveling of solid state disks based on usage information of data and parity received from a raid controller
WO2011044515A2 (en) 2009-10-09 2011-04-14 Violin Memory, Inc. Memory system with multiple striping of raid groups and method for performing the same
CN102812518B (zh) * 2010-01-28 2015-10-21 惠普发展公司,有限责任合伙企业 存储器存取方法和装置
US9063836B2 (en) * 2010-07-26 2015-06-23 Intel Corporation Methods and apparatus to protect segments of memory
US8775868B2 (en) * 2010-09-28 2014-07-08 Pure Storage, Inc. Adaptive RAID for an SSD environment
US9535804B2 (en) * 2012-05-21 2017-01-03 Cray Inc. Resiliency to memory failures in computer systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456394A (zh) * 2010-10-20 2012-05-16 三星电子株式会社 执行dram刷新操作的存储电路、系统和模块及其操作方法
WO2012081732A1 (en) * 2010-12-15 2012-06-21 Kabushiki Kaisha Toshiba Semiconductor storage device and method of controlling the same
CN102203740A (zh) * 2011-05-27 2011-09-28 华为技术有限公司 数据处理方法、装置及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3015986A4 *

Also Published As

Publication number Publication date
EP3015986A4 (en) 2016-07-27
CN104347122B (zh) 2017-08-04
US20160147600A1 (en) 2016-05-26
CN104347122A (zh) 2015-02-11
KR101837318B1 (ko) 2018-03-09
EP3015986A1 (en) 2016-05-04
US9811416B2 (en) 2017-11-07
EP3015986B1 (en) 2017-10-04
KR20160030978A (ko) 2016-03-21

Similar Documents

Publication Publication Date Title
WO2015014301A1 (zh) 一种消息式内存模组的访存方法和装置
JP6882115B2 (ja) Ddr sdramインタフェイスのためのdram支援エラー訂正方法
US10459793B2 (en) Data reliability information in a non-volatile memory device
US8086783B2 (en) High availability memory system
KR102190683B1 (ko) 메모리 데이터 에러 정정 방법
KR102198611B1 (ko) 메모리 내 에러 수정 방법
US20140245097A1 (en) Codewords that span pages of memory
TWI451257B (zh) 保護在直接附加儲存(das)系統中快取資料的完整性之裝置及方法
KR20140140632A (ko) 로컬 에러 검출 및 글로벌 에러 정정
US20130304970A1 (en) Systems and methods for providing high performance redundant array of independent disks in a solid-state device
US20170288705A1 (en) Shared memory with enhanced error correction
US9063869B2 (en) Method and system for storing and rebuilding data
US10606690B2 (en) Memory controller error checking process using internal memory device codes
TW201237622A (en) Semiconductor storage device and method of controlling the same
KR20170042433A (ko) 정정 불가능한 ecc 오류를 갖는 데이터를 복구하도록 구성되는 raid 컨트롤러 장치 및 스토리지 장치
US20210359704A1 (en) Memory-mapped two-dimensional error correction code for multi-bit error tolerance in dram
Kwon et al. Understanding ddr4 in pursuit of in-dram ecc
US7873895B2 (en) Memory subsystems with fault isolation
TW200828330A (en) Allowable bit errors per sector in memory devices
US20160139988A1 (en) Memory unit
WO2016122515A1 (en) Erasure multi-checksum error correction code
US20150200685A1 (en) Recording and reproducing device, error correction method, and control device
KR102414202B1 (ko) 계층적 디코더를 이용한 오류 정정
EP3499376B1 (en) Memory system varying operation of memory controller according to internal status of memory device
US9043655B2 (en) Apparatus and control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14832194

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2014832194

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014832194

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20167003393

Country of ref document: KR

Kind code of ref document: A