CN112181703B - CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method - Google Patents

CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method Download PDF

Info

Publication number
CN112181703B
CN112181703B CN202011043590.8A CN202011043590A CN112181703B CN 112181703 B CN112181703 B CN 112181703B CN 202011043590 A CN202011043590 A CN 202011043590A CN 112181703 B CN112181703 B CN 112181703B
Authority
CN
China
Prior art keywords
read
write
record
modify
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011043590.8A
Other languages
Chinese (zh)
Other versions
CN112181703A (en
Inventor
张英
王永文
王蕾
周宏伟
邓让钰
杨乾明
励楠
冯权友
曾坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202011043590.8A priority Critical patent/CN112181703B/en
Publication of CN112181703A publication Critical patent/CN112181703A/en
Application granted granted Critical
Publication of CN112181703B publication Critical patent/CN112181703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The invention discloses a CAM supporting a soft error retransmission mechanism between a capacity processor and a memory board and an application method, wherein the CAM comprises a recording storage unit, and n recorded contents comprise effective identification, read address, read-modify-write related identification and retransmission time count; the comparator array comprises n comparators, each comparator comprises two paths of input, one path of input is a write address, the other path of input is a read address in a corresponding record, and a control signal of the comparator is an effective identifier in the corresponding record; and the read-modify-write correlation judgment module is used for judging whether the read-modify-write correlation exists according to the output of the n comparators. The CAM design supports a read request retransmission mechanism based on the correlation CAM, can accommodate the inter-board soft error between the main processor and the off-chip memory with the maximum probability under the condition of not influencing the system performance as much as possible, has the advantages of simple structure, convenient implementation and no blockage, and has both high performance and high efficiency for accommodating the inter-board soft error.

Description

CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method
Technical Field
The invention relates to a communication fault-tolerant technology between a processor and a memory, in particular to a CAM (computer-aided manufacturing) supporting a soft error retransmission mechanism between the processor and the memory and an application method.
Background
In the mainstream design of the current high-performance processor, the processor and the off-chip memory are communicated through a mainboard, and the most common mode is that the processor sends out memory access read-write commands and write data containing memory addresses according to a DDR protocol and transmits the memory access read-write commands and the write data to the off-chip content through the mainboard; the off-chip memory follows DDR protocol and returns read response data to the processor through the mainboard. In the DDR protocol, techniques such as ODT (On-die termination) are used to solve crosstalk and reflection between signals, but as the process advances, the reduced power supply voltage and high clock frequency exacerbate the effects of noise sources, such as particle impact and crosstalk, which can cause transient ERRORs, i.e., SOFT ERRORs (SOFT ERROR), in transmitted data. When the system works in a severe working environment, the probability of the occurrence of the soft errors in the communication between the boards is greatly increased.
In order to accommodate soft errors between the processor and the memory board, a read request retransmission mechanism for accommodating soft errors needs to be designed. However, one of the most important design difficulties in the read request retransmission mechanism is how to retransmit the request without violating the correlation between the read and write requests. Specifically, the content data stored in the target address of the read request to be retransmitted is not modified by the write request between the initial read and the first retransmission read, or the data of the target address between the current retransmission request and the previous retransmission request is not modified. For each memory access request sent to the memory, if the target address of the write request is the same as that of a certain inflight (contention) read request, the content of the target address is modified after the write request is sent out. Thus, if the read request has returned data errors due to inter-board soft errors, but the read request cannot be retransmitted from a correctness perspective, otherwise the read-back data is modified data rather than the target data. Therefore, the write request may cause some inter-board soft errors to be non-fault tolerant. Therefore, how to retransmit the request without violating the correlation between the read and write requests, and further improve the reliability of implementing the soft error tolerance between the boards, has become a key technical problem to be solved urgently.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: the CAM design supports a read request retransmission mechanism based on a correlation CAM, can contain the interplate soft errors between a main processor and an off-chip memory with the maximum probability under the condition of not influencing the system performance as much as possible, has the advantages of simple structure, convenient implementation and no blockage, and has both performance and high efficiency.
In order to solve the technical problems, the invention adopts the technical scheme that:
a CAM for supporting a soft error retransmission mechanism between a processor and a memory board, comprising:
the record storage unit comprises n records, and the content of each record comprises an effective identifier, a read address, a read-modify-write related identifier and a retransmission time count;
the comparator array comprises n comparators which are in one-to-one correspondence with the records in the record storage unit, each comparator comprises two paths of inputs, one path of input is a write address, the other path of input is a read address in the corresponding record, and a control signal of the comparator is an effective mark in the corresponding record;
and the reading, modifying and writing related judging module is used for judging whether the output of the n comparators is related to reading, modifying and writing.
Optionally, the record storage unit contains 64 records in total.
Optionally, a bit width of the valid flag and the read-modify-write correlation flag is 1.
Optionally, the bit width of the retransmission number count is 8.
In addition, the present invention also provides an application method of the CAM supporting the soft error retransmission mechanism between the capacity processor and the memory board, which comprises the following steps:
1) Waiting for the access request, and if the received access request is a read request, executing the step 2); if the received access request is a write request, executing step 3); if the received access request is read response data, executing step 4);
2) Creating a new record in a record storage unit, filling an access address of a read request into a read address of the new record, setting an effective identifier of the new record to be 1, a read-modify-write related identifier to be 0 and a retransmission time count to be 0, and executing the step 1);
3) Aiming at the write address of the write request, judging whether the write request is related to the read-modify-write of a record with a valid identifier of 1 in the record storage unit by using a comparator array and a read-modify-write related judgment module, if so, setting the read-modify-write related identifier of the corresponding record to 1, and executing the step 1);
4) Checking the received read response data, if the read response data is checked to have no uncorrectable errors, returning the correct read response data, setting the effective identifier of the record corresponding to the read address of the read response data to 0, and executing the step 1); otherwise, skipping to execute the step 5);
5) Aiming at the read request record corresponding to the read response data, judging whether the read-modify-write related identifier is 0 and the retransmission time count is smaller than a preset threshold value, if so, sending a retransmission request, adding 1 to the retransmission time count of the record corresponding to the read address of the read response data, and executing the step 1); otherwise, end and exit.
Compared with the prior art, the invention has the following advantages: the CAM design supports a read request retransmission mechanism based on the correlation CAM, can accommodate the inter-board soft error between the main processor and the off-chip memory with the maximum probability under the condition of not influencing the system performance as much as possible, has the advantages of simple structure, convenient implementation and no blockage, and has both high performance and high efficiency for accommodating the inter-board soft error.
Drawings
FIG. 1 is a schematic diagram of a CAM according to an embodiment of the present invention.
Detailed Description
One of the most important design difficulties in the read request retransmission mechanism is how to retransmit the request without violating the correlation between the read and write requests. Specifically, the content data stored in the target address of the read request to be retransmitted is not modified by the write request between the initial read and the first retransmission read, or the data of the target address between the current retransmission request and the previous retransmission request is not modified. In order to ensure the correctness of the retransmitted read request data, the embodiment designs a CAM supporting the soft error retransmission mechanism between the processor and the memory board for determining the correlation.
As shown in fig. 1, a CAM (content-Addressable Memory) supporting a soft-error retransmission mechanism between a processor and a Memory board of the present embodiment includes:
the recording storage unit comprises n records, wherein the content of each record comprises a valid identifier (marked as V in FIG. 1), a read address (marked as Addreess in FIG. 1), a read-modify-write related identifier (marked as WAR in FIG. 1) and a retransmission time count (marked as Retry Cnt in FIG. 1);
the comparator array comprises n comparators which are in one-to-one correspondence with the records in the record storage unit, each comparator comprises two paths of inputs, one path of input is a write address, the other path of input is a read address in the corresponding record, and a control signal of the comparator is an effective mark in the corresponding record;
and the read-modify-write correlation judgment module is used for judging whether the read-modify-write correlation exists according to the output of the n comparators.
As an optional implementation manner, in this embodiment, the record storage unit includes 64 records in total, so that at most 64 records of read requests can be supported.
As an optional implementation manner, in this embodiment, the bit width of the valid flag and the read-modify-write related flag is 1.
As an optional implementation manner, the bit width of the retransmission number count in this embodiment is 8.
In this embodiment, the content of each record is shown in table 1:
table 1: and recording the recording content table of the storage unit.
Content providing method and apparatus Use of
Valid identification And the effective identification is used for storing the record, and the information stored in the record is effective.
Reading address And the read request address is used for storing the read request, and the content is an associated object among the read address, the write request and the read response data.
Read-modify-write correlation identification For storing whether or not read-modify-write is relevant, i.e.: this address has its contents modified by other write requests before returning the read response data, and the write request has been sent to off-chip memory.
Retransmission count For storing the number of retransmissions.
In order to accommodate the soft error between the processor and the memory, in this embodiment, a set of soft error-tolerant read request retransmission mechanism (a retransmission mechanism supporting write non-blocking) is further designed, which checks the received read response data (usually, ECC checking is used to check for one and two), and if there is an "uncorrectable" error in the received read data, tries to retransmit the read request until no error data is received or the maximum upper limit of attempts is reached. Because the soft error between the boards has the characteristic of instantaneity, the read request retransmission mechanism can accommodate the soft error between the boards with great probability. The read request retransmission mechanism is an application method of the CAM supporting the soft error retransmission mechanism between the capacity processor and the memory board, and comprises the following steps:
1) Waiting for the access request, and if the received access request is a read request, executing the step 2); if the received access request is a write request, executing step 3); if the received access request is read response data, executing the step 4);
2) Creating a new record in a record storage unit, filling an access address of a read request into a read address of the new record, setting an effective identifier of the new record to be 1, a read-modify-write related identifier to be 0 and a retransmission time count to be 0, and executing the step 1);
3) Aiming at the write address of the write request, judging whether the write request is related to the read-modify-write of a record with a valid identifier of 1 in the record storage unit by using a comparator array and a read-modify-write related judgment module, if so, setting the read-modify-write related identifier of the corresponding record to 1, and executing the step 1);
4) Checking the received read response data, if the read response data is checked to have no uncorrectable errors, returning the correct read response data, setting the effective identifier of the record corresponding to the read address of the read response data to 0, and executing the step 1); otherwise, skipping to execute the step 5);
5) Aiming at the read request record corresponding to the read response data, judging whether the read-modify-write related identification is 0 and the retransmission time count is smaller than a preset threshold value, if so, sending a retransmission request, adding 1 to the retransmission time count of the record corresponding to the read address of the read response data, and executing the step 1); otherwise, end and exit.
Combining the above steps 1) to 5), it can be seen that the update and timing of the CAM information are shown in table 2.
Table 2: updating and timing table of CAM information.
Content providing method and apparatus Set 1/update opportunity Pre-stack opportunity with 0 set
Valid identification Read request reception Read response return stack
Reading address Read request reception Is free of
Read-modify-write correlation identification Write request&Correlation Read request receiving stack Retransmission decision stack
Retransmission count Decision +1 after retransmission Read request receiving stack Retransmission decision stack
The determination condition of the retransmission in step 5) is that the read-modify-write correlation flag is 0, and the retransmission count is smaller than the preset threshold, which can be expressed as follows in combination with table 1: WAR and Retry Cnt do not exceed the standard ": the retransmission is performed, and specific examples thereof are shown in table 3.
Table 3: and (5) a retransmission decision table.
Read-modify-write correlation identification Retry Cnt superscalar The result of the judgment
Case 1 1 / Without retransmission
Case 2 0 1 Not to retransmit
Case 3 0 0 Retransmission
The CAM application method supporting the inter-board soft error retransmission mechanism of the capacity processor and the memory in the embodiment is based on the read request retransmission mechanism of the correlation CAM, and can accommodate the inter-board soft error between the main processor and the off-chip memory with the maximum probability under the condition of not influencing the system performance as much as possible.
In terms of performance, the method for applying the CAM supporting the soft error retransmission mechanism between the processor and the memory board in this embodiment does not block the write request of the request source, so that the fault-tolerant mechanism has little influence on the system performance. However, in this mechanism, the read request cannot be retransmitted due to the write to the same target address between the first request and the retransmitted request or between two adjacent retransmitted read requests, and thus the inter-board soft error cannot be tolerated. There are very few write-after-read related requests between adjacent requests. Then, in this mechanism, soft errors between boards cannot be accommodated due to dependencies, however, in modern high-performance processors, there are typically multiple levels of cache such as L1, L2, and even L3, so as to exploit and utilize the temporal locality, i.e., the dependencies, between data to the maximum possible. Therefore, the probability that the read request cannot be retransmitted due to the write of the same target address in the memory access request sent to the memory is also low. In summary, the CAM application method supporting the soft error retransmission mechanism between the processor and the memory board according to the present embodiment can achieve both high performance and high efficiency of soft errors between the memory boards.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (4)

1. A CAM for supporting a soft error retransmission mechanism between a processor and a memory board, comprising:
the record storage unit comprises n records, and the content of each record comprises an effective identifier, a read address, a read-modify-write related identifier and a retransmission time count;
the comparator array comprises n comparators which are in one-to-one correspondence with the records in the record storage unit, each comparator comprises two paths of inputs, one path of input is a write address, the other path of input is a read address in the corresponding record, and a control signal of the comparator is an effective mark in the corresponding record;
the read-modify-write correlation judgment module is used for judging whether the output of the n comparators is read-modify-write correlation;
the method for applying the CAM supporting the soft error retransmission mechanism between the processor and the memory board comprises the following steps:
1) Waiting for the access request, and if the received access request is a read request, executing the step 2); if the received access request is a write request, executing step 3); if the received access request is read response data, executing the step 4);
2) Creating a new record in a record storage unit, filling an access address of a read request into a read address of the new record, setting an effective identifier of the new record to be 1, a read-modify-write related identifier to be 0 and a retransmission time count to be 0, and executing the step 1);
3) Aiming at the write address of the write request, judging whether the write request is related to the read-modify-write of a record with a valid identifier of 1 in the record storage unit by using a comparator array and a read-modify-write related judgment module, if so, setting the read-modify-write related identifier of the corresponding record to 1, and executing the step 1);
4) Verifying the received read response data, if the read response data does not have uncorrectable errors through verification, returning the correct read response data, setting the effective identifier of the record corresponding to the read address of the read response data to be 0, and executing the step 1); otherwise, skipping to execute the step 5);
5) Aiming at the read request record corresponding to the read response data, judging whether the read-modify-write related identifier is 0 and the retransmission time count is smaller than a preset threshold value, if so, sending a retransmission request, adding 1 to the retransmission time count of the record corresponding to the read address of the read response data, and executing the step 1); otherwise, end and exit.
2. The CAM of claim 1, wherein the record storage unit comprises 64 records.
3. The CAM for supporting a soft error retransmission mechanism between a processor and a memory board according to claim 1, wherein the bit width of the valid flag, the read-modify-write correlation flag is 1.
4. The CAM for supporting a soft error retransmission mechanism between a processor and a memory board according to claim 1, wherein the bit width of the retransmission count is 8.
CN202011043590.8A 2020-09-28 2020-09-28 CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method Active CN112181703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011043590.8A CN112181703B (en) 2020-09-28 2020-09-28 CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011043590.8A CN112181703B (en) 2020-09-28 2020-09-28 CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method

Publications (2)

Publication Number Publication Date
CN112181703A CN112181703A (en) 2021-01-05
CN112181703B true CN112181703B (en) 2022-10-28

Family

ID=73946331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011043590.8A Active CN112181703B (en) 2020-09-28 2020-09-28 CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method

Country Status (1)

Country Link
CN (1) CN112181703B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806108A (en) * 2021-08-25 2021-12-17 海光信息技术股份有限公司 Retransmission method, memory controller, processor system and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0707268A2 (en) * 1994-10-14 1996-04-17 Compaq Computer Corporation Easily programmable memory controller which can access different speed memory devices on different cycles
US5634112A (en) * 1994-10-14 1997-05-27 Compaq Computer Corporation Memory controller having precharge prediction based on processor and PCI bus cycles
CN1571951A (en) * 2001-08-23 2005-01-26 集成装置技术公司 FIFO memory devices having single data rate (sdr) and dual data rate (ddr) capability
CN102063340A (en) * 2011-01-19 2011-05-18 西安交通大学 Method for improving fault-tolerant capability of high-speed cache of magnetoresistance RAM (Random Access Memory)
CN102103558A (en) * 2009-12-18 2011-06-22 上海华虹集成电路有限责任公司 Multi-channel NANDflash controller with write retransmission function
CN105740168A (en) * 2016-01-23 2016-07-06 中国人民解放军国防科学技术大学 Fault-tolerant directory cache controller
CN107888512A (en) * 2017-10-20 2018-04-06 深圳市楠菲微电子有限公司 Dynamic shared buffer memory and interchanger
CN110727530A (en) * 2019-09-12 2020-01-24 无锡江南计算技术研究所 Error access memory request retransmission system and method based on window

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0707268A2 (en) * 1994-10-14 1996-04-17 Compaq Computer Corporation Easily programmable memory controller which can access different speed memory devices on different cycles
US5634112A (en) * 1994-10-14 1997-05-27 Compaq Computer Corporation Memory controller having precharge prediction based on processor and PCI bus cycles
CN1571951A (en) * 2001-08-23 2005-01-26 集成装置技术公司 FIFO memory devices having single data rate (sdr) and dual data rate (ddr) capability
CN102103558A (en) * 2009-12-18 2011-06-22 上海华虹集成电路有限责任公司 Multi-channel NANDflash controller with write retransmission function
CN102063340A (en) * 2011-01-19 2011-05-18 西安交通大学 Method for improving fault-tolerant capability of high-speed cache of magnetoresistance RAM (Random Access Memory)
CN105740168A (en) * 2016-01-23 2016-07-06 中国人民解放军国防科学技术大学 Fault-tolerant directory cache controller
CN107888512A (en) * 2017-10-20 2018-04-06 深圳市楠菲微电子有限公司 Dynamic shared buffer memory and interchanger
CN110727530A (en) * 2019-09-12 2020-01-24 无锡江南计算技术研究所 Error access memory request retransmission system and method based on window

Also Published As

Publication number Publication date
CN112181703A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
US11714750B2 (en) Data storage method and system with persistent memory and non-volatile memory
US8898408B2 (en) Memory controller-independent memory mirroring
US7937641B2 (en) Memory modules with error detection and correction
US8694857B2 (en) Systems and methods for error detection and correction in a memory module which includes a memory buffer
US7770077B2 (en) Using cache that is embedded in a memory hub to replace failed memory cells in a memory subsystem
US8103900B2 (en) Implementing enhanced memory reliability using memory scrub operations
US9454422B2 (en) Error feedback and logging with memory on-chip error checking and correcting (ECC)
US7587658B1 (en) ECC encoding for uncorrectable errors
US9063902B2 (en) Implementing enhanced hardware assisted DRAM repair using a data register for DRAM repair selectively provided in a DRAM module
US7676729B2 (en) Data corruption avoidance in DRAM chip sparing
CN109213693B (en) Storage management method, storage system and computer program product
US11609817B2 (en) Low latency availability in degraded redundant array of independent memory
CN112181703B (en) CAM supporting soft error retransmission mechanism between capacity processor and memory board and application method
US11520659B2 (en) Refresh-hiding memory system staggered refresh
WO2023024594A1 (en) Retransmission method, memory controller, processor system and electronic device
US8352786B2 (en) Compressed replay buffer
CN112181871B (en) Write-blocking communication control method, component, device and medium between processor and memory
US9251054B2 (en) Implementing enhanced reliability of systems utilizing dual port DRAM
US20150154073A1 (en) Storage control apparatus and storage control method
CN117080779B (en) Memory bar plugging device, method for adapting memory controller to memory bar plugging device and working method
US10312943B2 (en) Error correction code in memory
US20240338124A1 (en) Storage device using host memory buffer and method of operating the same
CN110597656B (en) Check list error processing method of secondary cache tag array
CN115705906A (en) Data storage device with data verification circuitry
GB2454597A (en) Packet receiving buffer where packet sub-blocks are stored as linked list with sequence numbers and start/end flags to detect read out errors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant