CN114138544A - Data reading and writing method and device and soft error processing system - Google Patents

Data reading and writing method and device and soft error processing system Download PDF

Info

Publication number
CN114138544A
CN114138544A CN202111465396.3A CN202111465396A CN114138544A CN 114138544 A CN114138544 A CN 114138544A CN 202111465396 A CN202111465396 A CN 202111465396A CN 114138544 A CN114138544 A CN 114138544A
Authority
CN
China
Prior art keywords
data
error
check
bits
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111465396.3A
Other languages
Chinese (zh)
Inventor
陶昱良
潘于
代开勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202111465396.3A priority Critical patent/CN114138544A/en
Publication of CN114138544A publication Critical patent/CN114138544A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1012Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
    • G06F11/102Error in check bits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1012Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
    • G06F11/1032Simple parity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies

Abstract

A data reading method and device, a data writing method and device and a soft error processing system are provided. The data reading method comprises the following steps: performing first verification on m bits of data read from a target row in n rows of the storage array by using an error verification method to obtain a first verification result; in response to the first check result indicating that the error occurring in the target row cannot be completely corrected, performing a second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result; combining the first check result and the parity check result to obtain a second check result corresponding to the target row; and taking the second check result as a reading result. The data reading method can correct multi-bit soft errors in the data memory in time with smaller hardware overhead and delay overhead, and the number of corrected bits which can be corrected by the data reading method is larger than the maximum error correction bit number which can be corrected by the error checking method.

Description

Data reading and writing method and device and soft error processing system
Technical Field
Embodiments of the present disclosure relate to a data reading method, a data writing method, a data reading apparatus, a data writing apparatus, and a soft error handling system.
Background
Whether the chip can continuously and stably work depends on the stability of each component on a chip system, and a Static Random-Access Memory (SRAM) is a component which is largely used in the current chip design, and the data stored in the SRAM can be constantly maintained as long as the SRAM is powered on. Usually, the static random access memory is used for storing key data, reducing time delay and improving the performance of a chip. For example, the static random access memory may be used as a Cache (Cache) in a CPU (central processing unit) or a GPU (graphics processing unit), a Data Buffer (Data Buffer) on a Data path, a First-in First-out queue (FIFO), and the like. Compared with a Dynamic Random-Access Memory (DRAM), the static Random-Access Memory has the characteristics of small delay and high speed, and does not need to be refreshed, so that the performance of a chip is improved, and the power consumption of accessing data is reduced.
Disclosure of Invention
At least one embodiment of the present disclosure provides a data reading method applied to a data memory, where n data are stored in the data memory, and a data width of each data is m bits, the n data are sequentially arranged to form a data array with n × m bits, and the data array is correspondingly stored as a storage array in the data memory, the data reading method includes: performing first verification on m bits of data read from a target row in n rows of the storage array by using an error verification method to obtain a first verification result; in response to the first check result indicating that the error occurring in the target row cannot be completely corrected, performing a second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result; combining the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits that can be corrected by the error check method; and taking the second check result as a reading result, wherein m and n are both positive integers.
For example, in a data reading method provided by at least one embodiment of the present disclosure, the data storage device provides a first check array for the storage array, where the first check array includes n error check storage rows corresponding to the n data in a one-to-one manner, each of the n error check storage rows includes p error check bits, and the p error check bits in the ith row of the n error check storage rows are used to store error check data corresponding to data in the ith row of the data array; the data memory is provided with a parity storage row for the storage array and the first check array, wherein the parity storage row comprises m bits corresponding to m columns of the storage array and p bits corresponding to p columns of the first check array, the m bits and the p bits are used for storing reference parity check vectors with m + p bits corresponding to m columns of the data array and p columns of the first check array, respectively, and both p and i are positive integers.
For example, in a data reading method provided by at least one embodiment of the present disclosure, performing a first check on m bits of data read from a target row of n rows of the memory array by using an error checking method to obtain a first check result, including: reading m bits of data stored in the target row and p bits of error check data corresponding to the target row to obtain a first data row; and performing the first check on the first data line by using the error check method to obtain the first check result.
For example, in a data reading method provided by at least one embodiment of the present disclosure, performing a second check on data of each of m columns read from the storage array by using a parity check method to obtain a parity check result, including: performing parity operation on data bits of each of the m columns read from the storage array and the p columns read from the first check array by using a parity method to obtain a detection parity vector with m + p bits; comparing the detection parity vector to the reference parity vector bit-wise, determining a plurality of difference bits between the detection parity vector and the reference parity vector; and obtaining the parity check result according to the plurality of difference bits.
For example, in a data reading method provided in at least one embodiment of the present disclosure, comparing the detection parity vector with the reference parity vector bit by bit includes: performing a bitwise XOR of the detection parity vector and the reference parity vector.
For example, in a data reading method provided by at least one embodiment of the present disclosure, in response to that the first check result indicates that an error occurring in the target row cannot be completely corrected, where the first check result includes the first data row, and combining the first check result and the parity check result to obtain a second check result corresponding to the target row, the method includes: determining e potential error bits with errors in the first data row according to the parity check result, wherein the positions of the e potential error bits are the positions of the plurality of difference bits, and e is an integer; and in response to the fact that e is larger than a-1 and smaller than or equal to a preset correction threshold value, combining the e potential error bits, the error checking method and the first data row, and executing a trial-and-error combination test to obtain the second checking result, wherein a is the maximum error detection bit number which can be detected by the error checking method.
For example, in a data reading method provided in at least one embodiment of the present disclosure, a trial-and-error combination test is performed in combination with the e potential error bits, the error checking method, and the first data row, and includes: determining at least one correction combination consisting of each a of the e potential error positions, and performing the trial-and-error combination test on the at least one correction combination; wherein the trial and error combination testing comprises performing trial and error testing on each selected correction combination in sequence, the trial and error testing comprising: turning a data bits corresponding to the a potential error bits included in the selected correction combination in the first data row to obtain an intermediate test data row corresponding to the first data row; and performing the first check on the intermediate test data line by using the error check method, responding to the intermediate test data line with at most a-1 bit error, processing the intermediate test data line to obtain the second check result, stopping the trial-and-error combination test, and responding to the intermediate test data line with a bit error, and performing the trial-and-error test on the next correction combination.
For example, in the data reading method provided by at least one embodiment of the present disclosure, the at least one correction combination sequentially performs the trial-and-error test in an order from a smaller data bit distance to a larger data bit distance, and the data bit distance of each correction combination is determined according to a distance between a potential error bits included in each correction combination.
For example, in a data reading method provided by at least one embodiment of the present disclosure, in response to that at most a-1 bit of the intermediate test data line has an error, processing the intermediate test data line to obtain the second check result, and stopping the trial-and-error combination test, the method includes: in response to the intermediate test data row having no error, taking the intermediate test data row as the second check result, and stopping the trial-and-error combined test; and responding to the b bits in the intermediate test data row with errors, correcting the b bits by using the error checking method, taking a correction result as the second checking result, and stopping the trial-and-error combination test, wherein b is a positive integer and is less than or equal to a-1.
For example, in a data reading method provided by at least one embodiment of the present disclosure, combining the first check result and the parity check result to obtain a second check result corresponding to the target row, includes: determining a plurality of potential error bits in which an error exists among m bits of data read from the target row using the parity result; and in response to that the number of the plurality of potential error bits is equal to the maximum error detection bit number a which can be detected by the error checking method, processing m bits of the data read from the target row according to the plurality of potential error bits to obtain the second checking result.
For example, the data reading method provided by at least one embodiment of the present disclosure further includes: in response to the number of the plurality of potential misalignment bits being within a preset correction range, constructing at least one correction combination based on the plurality of potential misalignment bits, wherein each correction combination consists of a selected a number of potential misalignment bits; performing a trial and error combination test on the at least one correction combination; wherein the trial and error combination testing comprises performing trial and error testing on each selected correction combination in sequence, the trial and error testing comprising: and turning a data bits corresponding to the a potential error positions included by the selected correction combination in the m bits of the data read by the target row to obtain an intermediate test data row, performing the first check on the intermediate test data row by using the error check method to obtain a first intermediate check result, responding to the first intermediate check result that the first intermediate check result is passed, obtaining a second check result based on the first intermediate correction result, stopping the trial-and-error combination test, and responding to the first intermediate check result that the first intermediate check result is failed, and performing the trial-and-error test on the next correction combination.
For example, in the data reading method provided in at least one embodiment of the present disclosure, the error checking method is a single-error-correction double-error-detection method.
For example, the data reading method provided by at least one embodiment of the present disclosure further includes: and in response to the first verification result indicating that the error occurring in the target row can be completely corrected, taking the first verification result as the read result.
At least one embodiment of the present disclosure provides a data writing method for writing data into a data storage, where the data storage is configured to be able to store n data, each of the n data has a data width of m bits, the n data are sequentially arranged to form a data array with n × m bits, and the data array with n × m bits correspondingly stores a storage array with n × m bits in the data storage, the data writing method includes: generating first verification data for target data to be written into a target row in n rows of the storage array based on an error verification method, wherein the first verification data is used for verifying the target row by using the error verification method; and obtaining a reference parity check vector by using a parity check method based on the target data, wherein the reference parity check vector is used for performing parity check on each of the m columns.
For example, in a data writing method provided by at least one embodiment of the present disclosure, the data storage device provides a first check array for the storage array, the first check array includes n error check storage rows corresponding to the n data in a one-to-one manner, the data storage device provides a parity check storage row for the storage array and the first check array, and the data writing method further includes: writing the target data to a target row in the storage array; writing the first check data into an error check storage row corresponding to the target row in the first check array; writing the reference parity vector to the parity storage row.
For example, in a data writing method provided in at least one embodiment of the present disclosure, a reference parity vector is obtained by a parity check method based on the target data, including; and reading a current reference check vector stored in the parity check storage row, and carrying out bitwise XOR operation on the current reference check vector, the target data and the first check data to obtain the reference parity check vector.
At least one embodiment of the present disclosure provides a soft error processing system, including a data storage and a control circuit, where n data are stored in the data storage, and a data width of each data is m bits, the n data are sequentially arranged to form a data array with n × m bits, and the data array is correspondingly stored in the data storage as a storage array, the control circuit includes a controller and an error checker, and the error checker is configured to perform a first check on m bits of data read from a target row of n rows of the storage array by using an error checking method, so as to obtain a first check result; the controller is configured to: in response to the first check result indicating that the error occurring in the target row cannot be completely corrected, performing a second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result; combining the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits that can be corrected by the error check method; and outputting the second check result as a reading result, wherein m and n are both positive integers.
For example, in a soft error handling system provided in at least one embodiment of the present disclosure, the data storage device provides the storage array with a first check array, the first check array includes n error check storage rows corresponding to the n data in a one-to-one manner, each of the n error check storage rows includes p error check bits, p error check bits in an ith row of the n error check storage rows are used for storing error check data corresponding to data in an ith row of the data array, the data storage device provides the storage array and the first check array with a parity storage row, wherein the parity storage row includes m bits corresponding to m columns of the storage array in a one-to-one manner and p bits corresponding to p columns of the first check array in a one-to-one manner, and the m bits and the p bits are used for storing m columns of the data array and the p columns of the first check array respectively corresponding to m + columns A reference parity vector of p bits, where p and i are both positive integers.
For example, in a soft error handling system provided by at least one embodiment of the present disclosure, when the error checker performs a first check on m bits of data read from a target row of n rows of the memory array by using an error checking method, and obtains a first check result, the method includes the following steps: receiving m bits of data stored in the target row read from the storage array and p bits of error check data corresponding to the target row to obtain a first data row; judging whether a bits in m + p bits of the first data line have errors by using the error checking method, wherein a is the maximum error detection bit number which can be detected by the error checking method; and outputting the first data row to the controller in response to the error of the a bit in the first data row, wherein the first verification result comprises the first data row, performing correction processing on the first data row in response to the error of at most a-1 bit in the first data row, and outputting a correction result to the controller, wherein the first verification result comprises the correction result.
For example, in a soft error handling system provided by at least one embodiment of the present disclosure, when the controller performs a second check on data of each of m columns read from the storage array by using a parity check method to obtain a parity check result, the following operations are performed: performing parity operation on data bits of each of the m columns read from the storage array and the p columns read from the first check array by using a parity method to obtain a detection parity vector with m + p bits; reading the reference parity vector from the parity storage row; comparing the detection parity vector to the reference parity vector bit-wise, determining a plurality of difference bits between the detection parity vector and the reference parity vector; and obtaining the parity check result according to the plurality of difference bits.
For example, in the soft error processing system provided by at least one embodiment of the present disclosure, when the controller performs combining the first check result and the parity check result to obtain the second check result corresponding to the target row, the following operations are performed: determining e potential error bits with errors in the first data row according to the parity check result, wherein the positions of the e potential error bits are the positions of the plurality of difference bits, and e is an integer; and in response to the fact that the e is larger than a-1 and smaller than or equal to a preset correction threshold value, combining the e potential error positions, the error checking method and the first data row, and executing a trial-and-error combination test to obtain the second checking result.
For example, in the soft error handling system provided by at least one embodiment of the present disclosure, when the controller performs a combined trial and error test to obtain the second check result by combining the e potential error bits, the error checking method, and the first data row, the method includes the following steps: determining at least one correction combination consisting of each a of the e potential error positions, and performing the trial-and-error combination test on the at least one correction combination; wherein the trial and error combination testing comprises performing trial and error testing on each selected correction combination in sequence, the trial and error testing comprising: turning a data bits corresponding to the a potential error bits included in the selected correction combination in the first data row to obtain an intermediate test data row corresponding to the first data row; sending the intermediate test data row to the error checker; and stopping the trial-and-error combination test in response to receiving the first flag signal sent by the error checker, outputting a second check result sent by the error checker, and executing the trial-and-error test on the next correction combination in response to receiving the second flag signal sent by the error checker.
For example, in a soft error handling system provided in at least one embodiment of the present disclosure, the error checker is further configured to: performing the first check on the intermediate test data row received from the controller by using the error checking method, in response to no error in the intermediate test data row, using the intermediate test data row as the second check result, and sending the second check result and the first flag signal to the controller, in response to b bits in the intermediate test data row having an error, correcting the b bits by using the error checking method, using the correction result as the second check result, and sending the second check result and the first flag signal to the controller, wherein b is a positive integer and is less than or equal to a-1, and in response to a bits still having an error in the intermediate test data row, outputting the second flag signal to the controller.
For example, in the soft error handling system provided in at least one embodiment of the present disclosure, the control circuit further includes an error check code generator configured to generate first check data for target data to be written to a target row of n rows of the memory array based on an error check method, where the first check data is used to check the target row by the error check method; the controller is further configured to obtain a reference parity vector by a parity check method according to the target data, wherein the reference parity vector is used for performing parity check on each of the m columns.
For example, in the soft error processing system provided in at least one embodiment of the present disclosure, the control circuit further includes an enable selector, an address selector, a read data selector, and a write data selector, and the enable selector is configured to input, under the control of the controller, an enable signal determined based on a data write request or a read request, or an enable signal generated by the controller, to an enable port of the data memory; the address selector is configured to input an address determined based on a data write request or a read request, or an address generated by the controller, to an address port of the data memory under the control of the controller; the read data selector is configured to input data received from a read data port of the data memory or an intermediate test data row generated by the controller to the error checker under the control of the controller; the write data selector is configured to input the target data and the first check data, or the reference parity vector generated by the controller, to a write data port of the data memory under the control of the controller.
At least one embodiment of the present disclosure provides a data reading apparatus applied to a data memory, where n data are stored in the data memory, and a data width of each data is m bits, the n data are sequentially arranged to form a data array with n × m bits, and the data array is correspondingly stored as a storage array in the data memory, the data reading apparatus includes: the first checking unit is configured to perform first checking on m bits of data read from a target row in n rows of the storage array by using an error checking method to obtain a first checking result; a second checking unit, configured to perform a second check on the data of each of the m columns read from the storage array by using a parity checking method in response to the first checking result indicating that the error occurring in the target row cannot be completely corrected, so as to obtain a parity checking result; a correcting unit configured to combine the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits that can be corrected by the error checking method; and the output unit is configured to output the second check result as a reading result, wherein m and n are both positive integers.
At least one embodiment of the present disclosure provides a data writing device, configured to write data into a data storage, where the data storage is configured to store n data, a data width of each of the n data is m bits, the n data are sequentially arranged to form a data array with n × m bits, and the data array with n × m bits is correspondingly stored as a storage array with n × m bits in the data storage, and the data writing device includes: a first verification data generation unit configured to generate first verification data for data to be written to a target row of n rows of the storage array based on an error verification method, wherein the first verification data is used for verifying the target row by using the error verification method; and a second parity data generation unit configured to obtain a reference parity vector by a parity check method according to the data to be written into the target row, wherein the reference parity vector is used for performing parity check on each of the m columns.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.
FIG. 1 is a schematic block diagram of a multi-core chip system;
FIG. 2A is a schematic circuit diagram of an SRAM;
FIG. 2B is a schematic diagram of an equivalent memory array of the SRAM;
FIG. 3 illustrates the process flow after a soft error is detected in the SRAM;
FIGS. 4A-4D are schematic diagrams of the SRAM memory structure;
fig. 5 is a schematic flow chart of a data reading method according to at least one embodiment of the disclosure;
FIG. 6 is a schematic block diagram of a data store provided in at least one embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a potential misalignment provided by at least one embodiment of the present disclosure;
fig. 8 is a flowchart of a data reading and writing method according to at least one embodiment of the disclosure;
fig. 9 is a schematic flow chart of a data writing method according to at least one embodiment of the disclosure;
fig. 10 is a flowchart of a data writing method according to at least one embodiment of the disclosure;
FIG. 11 is a schematic block diagram of a soft error handling system provided in at least one embodiment of the present disclosure;
FIG. 12 is a block diagram of a soft error system provided in at least one embodiment of the present disclosure;
FIGS. 13A-13E are schematic diagrams of soft errors provided by at least one embodiment of the present disclosure;
fig. 14 is a schematic block diagram of a data reading apparatus provided in at least one embodiment of the present disclosure;
fig. 15 is a schematic block diagram of a data writing device according to at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.
FIG. 1 is a schematic block diagram of a multi-core chip system. As shown in FIG. 1, the system is a typical 4-Core system on a chip, and includes 4 cores (cores), 4 levels of cache corresponding to the cores (L1 cache, L2 cache, and L3 cache), an on-chip interconnect network, dynamic random access memory, and other Intellectual Property cores (Intellectual Property cores). I-L1$ is a private instruction L1 cache per core, D-L1$ is a private data L1 cache per core, with each two cores sharing an L2 cache, and four cores sharing an L3 cache. The L3 cache and other intellectual property cores (e.g., direct memory access/video/display, etc.) access dynamic random access memory through the on-chip internet.
On this typical multi-core system-on-a-chip, the L1, L2, and L3 caches contain large amounts of static random access memory, and in addition, there are also large amounts of data caches made up of static random access memory within cores, within other intellectual property cores, and within the on-chip interconnect network.
FIG. 2A is a schematic circuit diagram of an SRAM, and FIG. 2B is a schematic memory array diagram of an SRAM equivalent to that of FIG. 2A.
As shown in FIG. 2A, the SRAM circuit mainly includes a row address decoding, a column address decoding, a bit line selection, a column multiplexer, and a memory array. The width of read-write data of the static random access memory is m bits, the width of read-write address is N bits, wherein the width of column address is k bits, and the width of row address is N-k bits. After decoding the row address, selecting a row in the memory array by a word line, wherein one row in the memory array comprises 2kGroup m bits of data. Decoding the column address to obtain a column strobe address, and selecting a row 2 selected by the word line through the column multiplexerkA set of m-bit data of the set of m-bit data is read or written.
As shown in fig. 2B, the sram may be equivalent to a two-dimensional array of n rows and m bits per row, for example, n ═ 2N. Each block in fig. 2B is a memory cell, i.e., a bit, in the sram.
For various reasons, the sram may have data errors, i.e., data read from the same address is not equal to data written previously. The errors of the sram are mainly classified into two types:
the first type is hard errors, i.e. the circuits of the memory array of the internal part of the sram are permanently damaged, resulting in irreversible errors of this part of the data. Such errors are mainly caused by defects In the chip manufacturing process and circuit aging, an MBIST (Memory built-In-Self Test) circuit can be used to detect a Memory region In which an error occurs, and the Memory region In which an error occurs needs to be avoided In the chip using process.
The second type is soft errors, which are not permanent and disappear after writing new data to the same address.
The main causes of soft errors include two: the penetration of radioactive particles causes the inversion of the memory cell of the static random access memory, and the dynamic voltage noise causes the inversion of the memory cell when reading and writing data. As transistor sizes shrink, the operating voltages of the sram become lower and lower, and the distances between adjacent memory cells in the sram become closer, which results in the soft errors of the sram becoming more common. Since the soft error is a dynamic error, which may occur in data reading and writing at any time and any position when the chip normally operates, and cannot be detected and avoided in advance like a hard error, the soft error must be dynamically processed.
Two reasons for soft errors in sram are local, i.e. the errors usually concentrate in local memory areas, because the following reasons:
(a) soft errors caused by radioactive particle penetration mainly result from the voltage at which the radioactive particles penetrate the semiconductor material and disturb the memory cell latches. Depending on the angle and intensity of penetration of the radioactive particles, one or more memory cells may be caused to flip. Generally, the penetration is a straight line, and the error may occur in a local area in three directions (horizontal direction, vertical direction, and diagonal direction). The probability of a radioactive event occurring is low and typically only causes a local 1 to 2 bit error.
(b) The soft errors caused by dynamic voltage noise mainly come from the turnover of local memory cells caused by power supply noise when the static random access memory is read and written. For example, during the time the main clock in the chip is turned on (with a clock signal input) to turned off (without a clock signal input), many other transistors will transition, causing dynamic noise on the power lines. If the memory cells are read or written to at this time, some weaker memory cells may be disturbed. In addition, noise may also come from off-chip events, such as dynamic fluctuations in board level voltage and noise. The dynamic power noise affects only the weak memory cells, which are caused by the variation (variance) in the manufacturing process and have a local characteristic, so that the soft error generally occurs on local 1 to 2 bits. As with radioactive particle penetration, the dynamic voltage noise is increasingly affected as the process advances and the chip operating voltage decreases.
In summary, soft errors of the sram generally occur only in a local area, and the probability of occurrence is low, and generally only 1-2 bits. However, since once a soft error occurs, if the soft error cannot be recovered, a serious problem may occur to the system, the recovery of the soft error must be considered at the time of chip design.
Detection of soft errors typically occurs at the time of data read. For the handling of soft errors in sram, the current process flow can be as shown in fig. 3.
Specifically, when a soft error in the sram is detected, it is first determined whether the data can be directly corrected, and if the error can be corrected, the corrected data is directly returned. For example, when 1-bit error occurs, a SECDED (single-error correction and double-error detection) circuit may be used to correct the error.
If the error cannot be corrected, for example, only the parity check circuit is used, and the correction circuit is not used, it is first determined whether there is backup data:
if the backup data exists, reading the backup data and returning, for example, if the L1 cache has an error, the backup data in the L2 cache may be read, if the L2 cache has an error, the backup data in the L3 cache may be read, if the L3 cache has an error, the correct data may be obtained by reading the dynamic random access memory, and if the backup data also detects a soft error, the above steps may be repeatedly performed, that is, the backup data of the next level is continuously read until the correct data is obtained or no backup data is obtained;
if no backup data exists, interruption is generated and reported to the CPU, so that the software can recover data in an application layer, the software generally adopts different operations according to the severity of errors, for example, for the errors of general data, only the software needs to retransmit, and if serious system errors occur, the whole chip system may need to be reset, even the system is down.
The soft error processing of the static random access memory in the current chip design is mainly divided into two categories, the first category is only error detection and no correction, the second category is error detection and correction by using an ECC algorithm, and the implementation modes and the existing problems of the two schemes are specifically described below.
For the first category, a typical application is to add parity bits to the sram, and the schematic diagram of the sram structure of this scheme is shown in fig. 4A. The advantage of this scheme is that the hardware overhead is minimal, but the disadvantage is that only soft error detection is possible but no data correction is possible. When the data read out from the static random access memory detects a soft error, if the correct data is obtained, the next level of backup data needs to be read or an interrupt is reported to the CPU to enable the software to recover the data.
For the case of reading the next level of backup data, the time delay is large, for example, if the L1 cache has an error and needs to read the data in the L2 cache, the process takes about 10 cycles; when the L2 cache has errors, the data of the L3 cache needs to be read, and the process takes 30-50 cycles; while an error in the L3 cache memory requires reading the data of the dram, which takes about 200 cycles. Therefore, the data delay brought by the solution is large, and the performance of the chip is affected. In addition, accessing the next level of backup data typically results in additional data transfers, such as accessing the L3 cache or DRAM data requiring initiation of transfers over the on-chip interconnect network, which significantly increases the power consumption of the chip.
If there is no backup data, only interrupt is generated and reported to CPU, so that the software can recover the data in application layer. In this situation, software generally takes different operations according to the severity level of the error, such as an error of general data, and only needs to perform software-level retransmission, and if a serious system error occurs, the whole chip system may need to be reset, even causing a system downtime. The delay caused by the scheme is larger than that caused by reading the backup data (the delay reaches millisecond level and millions of cycles), and the stability and the use of the chip are seriously influenced.
For the second category, it can be known from the aforementioned reasons for generating soft errors that the probability of soft errors occurring in the sram is relatively small, and usually only 1 to 2 bits, so a typical solution is to add SECDED check bits into the sram, as shown in fig. 4B. The scheme has the advantages that the error data can be directly corrected after the data of the static random access memory is read, and the performance influence caused by extra time delay is avoided. But the disadvantage of this scheme is that only 1 erroneous bit can be corrected. Some chips also use DECTED (double-error correction and triple-error-detection) method for soft error correction, as shown in FIG. 4C. Although this method can correct 2-bit soft errors, the required memory overhead is larger, and the hardware implementation is more complex, also introducing additional delay and power consumption.
Table 1 compares the extra hardware memory cell overhead incurred by calibration using SECDED and DECTED circuits. The data bits represent the total number m of bits of the data to be detected, the check bits represent the total number of bits of check bits used to detect the data to be detected, and the total number of bits of the check bits is determined by an error detection algorithm. As can be seen from table 1, the overhead of the DECTED circuit is almost twice that of the SECDED circuit. When the soft error is more than 2 bits, the next level of backup data needs to be read or an interrupt is reported to the CPU to enable the software to recover the data. Although more error bits can be corrected using better algorithms, the additional hardware overhead increases with the number of correction bits, which increases chip area and power consumption significantly.
TABLE 1
Figure BDA0003391202640000131
As shown in FIG. 4D, in order to reduce the extra storage overhead of the SRAM and save the chip area, the error check bits (not limited to the error check bits in the SECDED or DECTED algorithm) can be stored in a region of the DRAM, and the SRAM still uses the scheme shown in FIG. 4A, i.e. adds a parity bit. The scheme can utilize the space of the existing dynamic random access memory on the chip, and does not need to add a storage unit for the static random access memory, so that the area of the chip is not increased. However, in the scheme, data is read and written each time, the dynamic random access memory is read and written by the on-chip internet, the delay is very large, generally about 200 clock cycles are needed, and the performance of the chip is greatly influenced. Moreover, the large number of accesses to the dynamic random access memory also greatly increases the power consumption of the chip.
Some high-performance chips have high requirements on RAS (Reliability, Availability, and Serviceability), and need to correct soft errors occurring in the sram in time to ensure stability and performance of the chip. In the aforementioned solution, detection and correction are mainly made for 1-bit soft errors. However, as the size of transistors is reduced and the voltage is reduced with the advance of semiconductor technology, the occurrence of 2-bit soft errors in the sram is becoming more and more common, and the 2-bit soft error correction scheme as described above cannot meet the usage requirements, either by introducing a large delay or by introducing a large memory overhead.
At least one embodiment of the present disclosure provides a data reading method, a data reading apparatus, a data writing method, a data writing apparatus, and a soft error handling system. The data reading method comprises the following steps: performing first verification on m bits of data read from a target row in n rows of the storage array by using an error verification method to obtain a first verification result; in response to the first check result indicating that the error occurring in the target row cannot be completely corrected, performing a second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result; combining the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits which can be corrected by the error check method; and taking the second check result as a reading result.
The data reading method carries out second check on each line of data to obtain a parity check result, carries out comprehensive analysis according to the parity check result and the first check result, can correct multi-bit soft errors in a data memory in time with smaller hardware overhead and delay overhead, and the number of corrected bits which can be corrected by the data reading method is larger than the maximum error correction bit number which can be corrected by the error check method. For example, when the error checking method is the SECDED method, the data reading method can correct 2-bit soft errors occurring in the data memory in time with smaller hardware overhead and delay overhead, and ensure the stability and performance of the chip system.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.
Fig. 5 is a schematic flow chart of a data reading method according to at least one embodiment of the present disclosure.
For example, the data reading method is applied to a data memory, for example, n data are stored in the data memory, the data width of each data is m bits, the n data are sequentially arranged to form a data array with n × m bits, and the data array is correspondingly stored in the data memory as a storage array. For example, the data storage is a static random access memory, and the storage array in the data storage can refer to the equivalent schematic diagram of the storage array shown in fig. 2B, where m and n are both positive integers.
For example, as shown in fig. 5, the data reading method provided by the embodiment of the present disclosure includes steps S10 to S40.
First, in step S10, a first verification is performed on m bits of data read from a target row of n rows of the memory array by using an error checking method, so as to obtain a first verification result.
In step S20, in response to the first check result indicating that the error occurred in the target row cannot be completely corrected, a second check is performed on the data of each of the m columns read from the storage array using the parity method, resulting in a parity result.
In step S30, the first check result and the parity check result are combined to obtain a second check result corresponding to the target row.
For example, the number of correction bits of the second check result is larger than the maximum number of error correction bits that the error check method can correct.
In step S40, the second check result is taken as the read result.
For example, the Error Checking method may include an Error Checking and Correcting (ECC) algorithm, and the ECC algorithm may include a single-Error-correcting and double-Error-correcting method (SECDED method for short), a double-Error-correcting and triple-Error-correcting method (DECTED method for short), and the like.
Generally, when the maximum error detection bit number that the error checking method can detect is a bits, the maximum error correction bit number that the error checking method can correct is a-1 bits, that is, the error checking method can detect that soft errors occur in the a bits in the target row at most, can correct the a-1 bits at most, and cannot correct the soft errors of the a bits. In the present disclosure, a represents the maximum number of error detection bits that can be detected by the error check method, and for example, a is 2 for the SECDED method and 3 for the DECTED method.
For example, in the data reading method provided in at least one embodiment of the present disclosure, after the first check result is obtained in step S10, if the first check result indicates that the error occurring in the target row cannot be completely corrected, that is, the soft error occurs in the a bit in the target row, a second check result is obtained according to steps S20-S40, where the first check result corrects the soft error in the a bit, and the second check result is used as the reading result; if the first check result indicates that the error of the target row can be completely corrected, that is, soft errors occur in at most a-1 bits in the target row, the error check method can directly correct the a-1 bits, and the first check result can be used as a read result.
For example, the data storage device is provided with a first check array for the storage array, the first check array comprises n error check storage rows corresponding to n data in a one-to-one manner, each of the n error check storage rows comprises p error check bits, and the p error check bits of the ith row in the n error check storage rows are used for storing error check data corresponding to data of the ith row of the data array, wherein p and i are both positive integers.
The data memory is provided with a parity storage row for the storage array and the first check array, e.g., the parity storage row comprises m bits corresponding one-to-one to m columns of the storage array and p bits corresponding one-to-one to p columns of the first check array, the m bits and the p bits being for storing reference parity vectors having m + p bits corresponding to the m columns of the data array and the p columns of the first check array, respectively.
Fig. 6 is a schematic structural diagram of a data storage according to at least one embodiment of the present disclosure. As shown in fig. 6, the data memory is provided with a memory array (shown by a dashed line frame in fig. 6) formed by sequentially arranging n m-bit data, for example, the memory array may be an equivalent memory array shown in fig. 2B, and the actual circuit logic thereof may refer to the form of fig. 2A, which is not described again here.
For example, the data storage device is further provided with a first check array (shown as a dotted-dashed box in fig. 6), where the first check array includes n error check storage rows, each error check storage row includes p error check bits, and the p error check bits are used for storing error check data corresponding to m bits of data located in the same row.
For example, the relationship of m and p satisfies the following formula:
p=q+1,m≤2q-q-1 (equation 1)
Equation 1 is derived from standard SECDED methods, such as m 16, p 6, such as m 32, p 7, such as m 64, p 8, and so on. Of course, when other error checking methods are used, m and p and the relationship may be adjusted accordingly, which is not limited by the present disclosure.
For example, the data memory is also provided with a parity storage row (shown as a dotted box in fig. 6) comprising m + p column parity bits for storing a reference parity vector. For example, the ith column parity bit in the parity storage row is used to store parity data corresponding to the corresponding ith column n-bit data.
It should be noted that the parity storage row shown in fig. 6 is located below the storage array, but the embodiments of the present disclosure are not limited thereto, and the parity storage row may also be located in the middle of the storage array or above the storage array. For example, if the address number corresponding to the first row of the memory array is 0 and the address number corresponding to the second row of the memory array is 1, the address number corresponding to the parity memory row may be n, and this setting method does not modify the arrangement order of the address numbers, and the peripheral read-write address logic of the data memory is not adjusted, so that the compatibility is stronger.
For example, the size of the storage space of the first check array is n × p bits, the size of the storage space of the parity storage row is m + p bits, and the size of the storage space provided by the data storage is: (n +1) (m + p).
Therefore, at least one embodiment of the present disclosure provides a data reading method, in which a percentage calculation formula of extra storage units (a first check array and a parity storage row) added to a data storage device relative to an original data area (i.e., a storage array) is as follows:
Figure BDA0003391202640000171
for example, when the address bit width N of the data memory is 6, N is 2NWhen the data bit number m is 32, p is 7 calculated according to formula 1, the size of the storage space of the storage array is 64 × 32 — 2048 bits, and the size of the storage space of the first check array is: 64 x 7-448 bits, the size of the storage space of the parity storage row is: the total overhead added to the data memory, except for the memory array, is (448+ 39)/2048-23.78%, 39 bits (32+ 7).
For example, relative to the scheme shown in fig. 4B, the overhead of the data memory in the present disclosure is only column parity storage rows, and thus the added overhead is 39/(2048+448) ═ 1.56%.
Table 2 below compares the scheme provided by the present disclosure with SECDED methods (SECDED in the table below), DECTED methods (DECTED in the table below) at different memory array sizes (e.g., different n and m) for additional memory cell overhead.
TABLE 2
Figure BDA0003391202640000172
Figure BDA0003391202640000181
As can be seen from table 2, the scheme provided by the present disclosure does not increase the number of memory cells compared to the SECDED method (shown in fig. 4B), but has a greater advantage in memory overhead compared to the DECTED method (shown in fig. 4C), especially in the case of larger n and smaller m.
The following describes in detail an implementation procedure of a data reading method provided by at least one embodiment of the present disclosure under the data storage structure shown in fig. 6 with reference to the drawings.
For example, step S10 may include: reading m bits of data stored in a target row and p bits of error check data corresponding to the target row to obtain a first data row; and carrying out first verification on the first data row by using an error verification method to obtain a first verification result.
For example, an N-bit input address may be determined according to the data read request, a target row to be read is determined according to the N-bit input address, then m bits of data stored in the target row and p bits of error check data corresponding to the target row are read from the memory array and the first check array, and a first data row composed of m bits of data and p bits of error check data is obtained.
And then, carrying out first verification on the first data row by using an error verification method to obtain a first verification result. For example, the first check may determine whether a bits of the m + p bits of the first data row have errors by using an error checking method, where a is the maximum number of error detection bits that can be detected by the error checking method. In response to the a bit in the first data row having an error, the first data row is taken as a first check result, and of course, the first check result may further include an error indication signal; and in response to the error of at most a-1 bits in the first data line, performing correction processing on the first data line, and taking a corrected correct result as a first check result, wherein the first check result can also comprise a valid indication signal.
For example, in response to the first check result indicating that the error occurred in the target row cannot be completely corrected, e.g., the first check result indicates that the a bit in the first data row has an error, the steps S20-S40 are continuously performed.
For example, step S20 may include: performing parity check operation on data bits of each of the m columns read from the storage array and the p columns read from the first check array by using a parity check method to obtain a detection parity check vector with m + p bits; comparing the detection parity check vector with the reference parity check vector bit-by-bit to determine a plurality of difference bits between the detection parity check vector and the reference parity check vector; a parity result is obtained from the plurality of difference bits.
For example, when calculating the detection parity vector, the detection parity vector PdThe calculation formula of (a) is as follows:
Pi=Pi-1^Riin which P is-1=RRdAddrI is 0 to n-1 and i! RdAddr (formula 3)
Wherein R isRdAddrDenotes the first data line, RdAddr denotes the address number of the target line, i sequentially takes 0, 1, 2, n-1, but i skips the address number of the target line, and when i equals n-1, the resulting P is obtainediI.e. detecting the parity check vector Pd. For example, m + P-bit data composed of m-bit data and P error check bits in the same row as shown in fig. 6 is called a data row, and n data rows in the data memory are subjected to cumulative bitwise xor operation to obtain a detected parity vector Pd
Thereafter, the currently stored reference parity vector P is read from the parity storage rowaWill detect the parity check vector PdAnd a reference parity check vector PaThe comparison is performed bitwise. For example, comparing the detection parity vector to the reference parity vector bit-wise may include: the detection parity vector is bitwise exclusive-ORed with the reference parity vector. For example, a bit representation of 1 in the XOR resultThe erroneous location in the parity vector, i.e., the plurality of difference bits, is detected, and the difference bits indicate that some of the data bits or error check bits in the column corresponding to the difference bits have been erroneous. For example, the parity result includes the position of the plurality of difference bits, the number of the plurality of difference bits, and the like.
For example, step S30 may include: determining e potential error positions with errors in the first data row according to the parity check result, wherein the e potential error positions are positions of a plurality of difference bits, and e is an integer; and in response to the fact that the e is larger than a-1 and smaller than or equal to a preset correction threshold value, combining the e potential error positions, the error checking method and the first data row, and executing a trial-and-error combination test to obtain a second checking result, wherein a is the maximum error detection bit number which can be detected by the error checking method.
For example, the preset correction threshold E represents the maximum traversable parity check bit number, and may be set as needed, and the larger the preset correction threshold E, the longer the time required for correction may be. Setting the maximum preset correction threshold E can control the correction time to prevent useless correction calculations due to errors for other reasons.
For example, performing a trial-and-error combination test in conjunction with the e potential error bits, the error checking method, and the first data row may include: and determining at least one correction combination consisting of each a potential error bit in the e potential error bits, and performing trial-and-error combination test on the at least one correction combination. For example, the correction combination number X is obtained according to the following formula:
Figure BDA0003391202640000201
here, C represents a combined calculation formula, "! "denotes a factorial operation.
For example, trial and error combination testing includes performing trial and error testing sequentially for each selected correction combination, including: turning a data bits corresponding to the a potential dislocation bits included in the selected correction combination in the first data row to obtain an intermediate test data row corresponding to the first data row; and performing first check on the intermediate test data row by using an error check method, responding to the condition that the intermediate test data row has errors at most in a-1 bit, processing the intermediate test data row to obtain a second check result, stopping the trial-and-error combination test, responding to the condition that the intermediate test data row still has errors in a bit, and performing the trial-and-error test on the next correction combination.
For example, in response to the intermediate test data row having at most a-1 bit errors, processing the intermediate test data row to obtain a second check result, and stopping the trial-and-error combination test may include: responding to the middle test data row without errors, taking the middle test data row as a second check result, and stopping the trial-and-error combined test; and responding to the b bit error in the middle test data row, correcting the b bit by using an error checking method, taking the correction result as a second checking result, and stopping the trial-and-error combination test, wherein b is a positive integer and is less than or equal to a-1.
For example, if E is greater than the preset correction threshold E, it indicates that the data cannot be corrected, at this time, the second check result may include an error indication signal, and then the backup data of the next stage is read or an interrupt reporting software process is generated, and the specific process may refer to the related content in fig. 3.
If e < a, in this case, an a-bit error in the first data line cannot be located, and then the second check result may include an error indication signal, and then the backup data of the next stage is read or an interrupt reporting software process is generated, which may refer to the related content in fig. 3.
If a-1< E ≦ E, performing a trial-and-error combination test. Specifically, a potential error positions are arbitrarily selected from the e potential error positions to be combined, X correction combinations are obtained, and a trial-and-error test is performed on each selected correction combination in sequence.
Fig. 7 is a schematic diagram of a potential misalignment provided by at least one embodiment of the present disclosure. As shown in fig. 7, there are two soft error bits in the first data row, and the detection parity vector and the reference parity vector are subjected to bit xor to obtain 3 difference bits, where the positions of the difference bits are the positions of the 3 potential error bits in fig. 7.
For example, for a correction combination of a potential error bit with a bit index of 5 and a potential error bit with a bit index of 6, a data bit with a bit index of 5 and a data bit with a bit index of 6 in the first data line are subjected to an inversion process, for example, in the inversion process, if the value of the data bit is a first value (for example, 0), the value of the data bit is inverted to a second value (1), and if the value of the data bit is the second value, the value of the data bit is inverted to the first value, thereby obtaining an intermediate test data line; and then, performing first verification on the intermediate test data line by using an error verification method, for example, by using a SECDED method, wherein the intermediate test data line after being subjected to the inversion processing is used as a correct result because soft errors occur in the data bit with the bit index of 5 and the data bit with the bit index of 6 in the original first data line, and the intermediate test data line is used as a second verification result, and the trial-and-error combination test is stopped.
As mentioned above, the soft error of the SRAM has the characteristic of locality, and the probability of the soft error occurring in the adjacent memory cell is higher, so that the real soft error bit can be positioned from various correction combinations at the fastest speed by adopting a distance-first traversal mode. For example, the X correction combinations sequentially perform trial-and-error tests in order of the data bit distances from small to large, and the data bit distance of each correction combination is determined according to the distance between the a potential error bits included in each correction combination. Taking fig. 7 as an example, when the positions of the potential error bits are 5, 6, and 8 columns, and the error checking method is the SECDED method, the correction combination may include three types: for example, a correction combination 1 including potential misalignment bits with bit indexes 5 and 6, a correction combination 2 including potential misalignment bits with bit indexes 6 and 8, and a correction combination 3 including potential misalignment bits with bit indexes 5 and 8 may be preferentially selected to perform a trial-and-error test, where, for example, the data of the correction combination 1 is the distance 1, the data of the correction combination 2 is the distance 2, and the data of the correction combination 3 is the distance 3, the correction combination 1 is first subjected to a trial-and-error test, the correction combination 2 is subjected to a trial-and-error test if the verification fails, and the correction combination 3 is then subjected to a trial-and-error test if the verification fails. As described above, since the soft error of the sram has a local characteristic, a correct correction result may be obtained already in the correction combination 1, and the shortest correction process requires only 1 clock cycle, the distance-first traversal mode can shorten the correction cycle and accelerate the correction process.
For example, if all the last correction combinations still fail to perform the trial-and-error test, the second check result may include an error indication signal, and then the backup data of the next stage is read or an interrupt reporting software process is generated, and the specific process may refer to the related contents in fig. 3.
Finally, in step S40, the second check result is taken as the read result. For example, if the middle test data line of any correction combination is successfully verified, the read result is the second verification result including the correction result or the middle test data line, and the second verification result may further include a valid flag signal; if all the correction combinations fail to be checked, or the number E of potential error bits is greater than E or smaller than a, the second check result may include an error indication signal and the first data line, and then the chip system reads the backup data of the next stage or generates an interrupt to report to the software for processing after detecting the error indication signal, which may refer to the relevant contents in fig. 3.
The data reading method provided by at least one embodiment of the present disclosure can timely locate and correct an a-bit soft error occurring in the sram with a relatively small hardware overhead and a relatively small delay overhead by adding the parity storage row, and is compatible with the existing system design, thereby reducing the error probability of the chip system and ensuring the stability and performance of the chip system.
For example, in other embodiments, p-bit error check data may be additionally stored, that is, m-bit data is the object of correction.
For example, step S10 and step S20 are the same as the previous process at this time, and are not described here again.
For example, step S30 may include: determining a plurality of potential error bits in which an error exists among m bits of the data read from the target row using the parity result; and in response to the number of the plurality of potential error bits being equal to the maximum error detection bit number a which can be detected by the error checking method, processing the m bits of the data read from the target row according to the plurality of potential error bits to obtain a second checking result.
For example, the data reading method further includes: in response to the number of the plurality of potential misalignment bits being within a preset correction range, constructing at least one correction combination based on the plurality of potential misalignment bits, wherein each correction combination consists of a selected a number of potential misalignment bits; performing a trial and error combination test on at least one of the correction combinations; wherein, the trial-and-error combination test includes sequentially carrying out the trial-and-error test for each selected correction combination, and the trial-and-error test includes: and turning a data bits corresponding to the a potential dislocation bits included by the selected correction combination in the m bits of the data read by the target row to obtain an intermediate test data row, performing first verification on the intermediate test data row by using an error verification method to obtain a first intermediate verification result, responding to the first intermediate verification result as verification pass, obtaining a second verification result based on the first intermediate correction result, stopping the trial-error combination test, and responding to the first intermediate verification result as verification fail, and executing the trial-error test on the next correction combination.
For example, the first intermediate verification result is that the error verification method can correct the intermediate test data row by indicating that there is no error or at most an a-1 bit error in it.
The specific implementation process of the trial-and-error test may refer to the related description of step S30, and will not be described herein.
The data reading method provided by the embodiment can be used for timely positioning and correcting the a-bit soft error in the m-bit data of the static random access memory with smaller hardware overhead and delay overhead by adding the parity check storage row, is compatible with the existing system design, reduces the error probability of the chip system, and ensures the stability and the performance of the chip system.
Fig. 8 is a flowchart of a data reading and writing method according to at least one embodiment of the disclosure. For example, in this case, the error checking method is a SECDED method, that is, a is 2.
First, as shown in fig. 8, a read request is received, an address of a target row is determined, data of the target row is read, and a first check is performed on m bits of the data read by the target row, where the specific process may refer to relevant contents of step S10, and details are not repeated here.
Then, if the first verification result indicates that the error occurring in the target row can be completely corrected, for example, there is at most 1-bit error in the m-bit data (or the first data row) of the target row, the target row data or the correction result is directly output as the first verification result, and the read data is returned, for example, the read data is the m-bit data or the correction result in the target row, thereby completing one read process.
If the first check result indicates that the error occurring in the target row cannot be completely corrected, for example, a 2-bit error exists in the first data row or m-bit data, a second check is performed on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result, and the specific process may refer to the relevant content of step S20, which is not described herein again.
And then determining E potential error positions with errors in the first data row according to the parity check result, if E is larger than 1 and smaller than or equal to a preset correction threshold value E, determining X correction combinations, and performing trial-and-error test on each selected correction combination in sequence.
For example, for the ith correction combination, if the trial-and-error test passes, that is, at most 1 bit of the intermediate test data row corresponding to the ith correction combination has an error, outputting a second check result to complete a read process; if the trial-and-error test fails, that is, 2 bits of error exist in the middle test data row corresponding to the ith correction combination, performing the trial-and-error test on the (i +1) th correction combination until all the correction combinations complete the trial-and-error test, if correct data are not obtained yet, returning error data and an error indication signal, then reading the next level of backup data or generating interrupt reporting software for processing, and completing a reading process, wherein i is a positive integer and is less than or equal to the total number of the correction combinations. For example, all correction combinations may be sequentially tested for trial and error according to the distance-first principle, and for the specific implementation process of the trial and error test, reference may be made to the relevant content in step S30, which is not described herein again
If E is less than or equal to 1 or greater than the preset correction threshold value E, the wrong read data and the wrong indication signal are directly returned, and then the next level of backup data is read or the interruption reporting software processing is generated, so that a one-time reading process is completed.
Referring to the foregoing, at least one embodiment of the disclosure provides a maximum delay RL consumed by a data reading methodmaxAs shown in the following formula:
Figure BDA0003391202640000231
at this time, the a-bit soft error occurs in the first data row, and it is necessary to perform steps S20-S30, where reading n-1 data rows in step S20 consumes n-1 clock cycles, calculating the parity result consumes 1 clock cycle, and the trial-and-error combination test consumes X clock cycles.
At least one embodiment of the present disclosure provides a method for reading data with a minimum delay RLminAs shown in the following formula:
Figure BDA0003391202640000241
here, SE (soft error) represents the number of soft error bits in the first data row, when the number of soft error bits SE is equal to or less than a-1, the extra delay caused by the read operation is 0, when the number of data soft error bits SE is equal to a, the number E of potential error bits is greater than a-1 and equal to or less than the preset correction threshold E, and the a-bit error correction is completed in the first clock cycle, the extra delay caused by the read operation is n +1, wherein the process of calculating the detected parity check vector consumes n-1 clock cycles, the process of comparing the detected parity check vector with the reference parity check vector bit by bit consumes 1 clock cycle, and the step S30 is executed to consume 1 clock cycle.
The extra delay of the read operation under several common values of the address bit width N and the number of potential error bits e is listed in table 3. For example, in table 3, the error checking method is the SECDED method, that is, the maximum number of error detection bits a that can be detected is 2.
As can be seen from table 3, the read operation latency depends mainly on the address width N of the sram, and the smaller N, the smaller the latency overhead. When the depth of the sram is equal to or less than 128(N ═ 7), the read operation latency is less than the latency overhead (200 cycles) of reading the dram data, so the scheme provided by the present disclosure is more suitable for an sram with a not very large depth, i.e., a sram with a small N.
TABLE 3
Figure BDA0003391202640000242
Therefore, the data reading method provided by at least one embodiment of the present disclosure may be applicable to any size of the sram, and when the depth of the sram is smaller, the extra delay time required for correcting the a-bit error is smaller, but the area overhead ratio is relatively high; the larger the depth of the sram, the smaller the area overhead ratio, but the greater the delay in correcting the a-bit error. In chip design, many static random access memories with large storage space are usually divided into multiple small static random access memories to meet the requirements of convenient layout and routing and timing convergence, for example, a 1024x256 static random access memory is divided into four 128x256 blocks (banks), each block has smaller area and higher performance, and is more friendly to layout and routing. Thus, a large block of sram may be partitioned into small sram in a chip design to balance area and latency overhead.
In summary, the data reading method provided by at least one embodiment of the present disclosure has advantages in area overhead, performance, and power consumption, and can correct 2-bit soft errors that are increasingly common at present in real time and is compatible with correction of 1-bit soft errors. When the static random access memory reads data, 2-bit soft error correction is realized, the delay and power consumption expense of data recovery caused by the error of the static random access memory and the failure of correction are reduced, and the system performance is improved. In addition, the storage overhead of the static random access memory for correcting the 2-bit soft error is smaller than that of all other current schemes, so that the chip area and the cost are saved.
Specifically, on the area overhead, the data reading method provided by at least one embodiment of the present disclosure is superior to the soft error correction solution provided by fig. 4A-4D, and the larger the sram, the lower the proportion of area overhead. In terms of performance, the data reading method provided by at least one embodiment of the present disclosure is more preferable, for example, when there is at most a-1 bit error, no extra reading delay is generated, when there is a bit error, the reading operation delay can be controlled within 150 cycles for a static random access memory with a small depth (for example, N < ═ 7), and the smaller the depth of the static random access memory is, the smaller the delay is, compared with reading the next level of backup data (10 to 200 cycles) or reporting interrupt software processing (millions of cycles), there is a significant advantage in performance; for the static random access memory with larger depth, although the reading delay is relatively larger, the performance is still obviously superior to the reporting interrupt software processing (millions of cycles). In terms of power consumption, power consumption is also advantageous because a-bit error correction occurs when data is read, and no other system transfer needs to be initiated.
At least one embodiment of the present disclosure further provides a data writing method. Fig. 9 is a schematic flow chart of a data writing method according to at least one embodiment of the disclosure. As shown in fig. 9, the data writing method provided by the embodiment of the present disclosure includes steps S50 to S60.
For example, the data writing method is used to write data to a data memory. For example, the data memory is configured to be capable of storing n data, each of the n data has a data width of m bits, the n data are sequentially arranged to form a data array of n × m bits, and the data array of n × m bits is correspondingly stored as a storage array of n × m bits in the data memory. For example, the data memory is provided with a first check array including n error check storage rows in one-to-one correspondence with n data for the storage array, and a parity storage row for the storage array and the first check array. For the storage array, the first check array, and the parity storage row in the data storage, reference may be made to related contents of fig. 6, which are not described herein again.
In step S50, first verification data is generated for target data to be written to a target row of the n rows of the memory array based on an error verification method.
For example, the first check data is used to check the target row using an error checking method.
In step S60, a reference parity vector is obtained using a parity method based on the target data.
For example, a reference parity vector is used to parity each of the m columns.
For example, in step S50, when the data write request is received, the data write request is parsed, and the target data included in the data write request and the address of the target row to be written are determined, for example, the data width of the target data is m bits; the first check data of the target data is generated by using an error check method, such as a SECDED method, a DECTED method, and the like, for example, the first check data includes p error check bits, and a relationship between m and p is shown in formula 1, which is not described herein again.
For example, step S60 may include: and reading the current reference check vector stored in the parity check storage row, and carrying out bitwise XOR operation on the current reference check vector, the target data and the first check data to obtain a reference parity check vector.
For example, as shown in fig. 6, the address of the parity storage row is n, m bits of target data and p bits of first check data are spliced into a data row according to the positional relationship shown in fig. 6, and a bitwise xor operation is performed on the data row and the current reference check vector read from the parity storage row, so as to calculate an updated value of the reference parity vector.
For example, the data writing method may further include: writing target data to a target row in the memory array; writing the first check data into an error check storage row corresponding to a target row in the first check array; the updated value of the reference parity vector is written to the parity storage row.
Fig. 10 is a flowchart of a data writing method according to at least one embodiment of the disclosure.
As shown in fig. 10, first, when a data write request is received, first check data is generated, and the specific process is as described in step S50, which is not described herein again.
Thereafter, the operations of reading the parity storage row, writing the target data into the target row in the storage array, and writing the first check data generated in step S50 into the error check storage row corresponding to the target row in the first check array are completed in parallel.
Then, the reference parity check vector is calculated and written into the parity storage row, and the specific process of calculating the reference parity check vector is as described in step S60, which is not described herein again, and this process needs to take 1 additional clock cycle.
Therefore, in the data writing method provided in at least one embodiment of the present disclosure, the extra delay caused by the writing operation is 1 clock cycle, which is the time taken to calculate the reference parity vector and write the reference parity vector to the parity storage row.
Table 4 compares the disclosed solution with the solutions in fig. 4A-4D in terms of storage overhead, access latency, and power consumption. For example, the present disclosure adopts a scheme in which the address width N is 6, the data bit number m is 64, and the error checking method adopts a SECDED method, that is, a is 2. As can be seen from table 4, the scheme of the present disclosure can achieve timely correction of sram 2-bit soft errors with less memory overhead and power consumption, and moderate latency.
TABLE 4
Figure BDA0003391202640000271
The parameters of table 4 are explained here: in table 4, taking the solution of fig. 4A as an example, when 1 bit error occurs, if there is data backup (there is backup data in the next level of cache or DRAM), the access delay is 10 to 200 clock cycles, and if there is no data backup, it needs to report interrupt and be processed by software, and the time is more than millisecond (millions of clock cycles); the system bus needs to be initiated to transmit and read or recover data, so that large power consumption is caused; for the solution shown in fig. 4D, a system bus transfer needs to be initiated to read data in the dynamic random access memory, which results in a large power consumption.
At least one embodiment of the present disclosure also provides a soft error handling system corresponding to the data reading method and the data writing method described above.
FIG. 11 is a schematic block diagram of a soft error handling system according to at least one embodiment of the present disclosure.
As shown in FIG. 11, soft error reading system 100 includes data storage 101 and control circuit 102.
For example, data storage 101 may be a static random access memory.
For example, n data are stored in the data memory 101, each data has a data width of m bits, the n data are sequentially arranged to form a data array of n × m bits, and the data array is correspondingly stored as a storage array in the data memory. For example, the data storage 101 provides the storage array with a first check array, the first check array includes n error check storage rows corresponding to n data one-to-one, each of the n error check storage rows includes p error check bits, and the p error check bits of the ith row of the n error check storage rows are used for storing error check data corresponding to data of the ith row of the data array. For example, the data storage 101 is provided with a parity storage row for the storage array and the first check array, wherein the parity storage row comprises m bits corresponding to m columns of the storage array one to one and p bits corresponding to p columns of the first check array one to one, and the m bits and the p bits are used for storing reference parity vectors having m + p bits corresponding to the m columns of the data array and the p columns of the first check array, respectively. For the specific structure of the data storage, reference may be made to the related content of fig. 6, which is not described herein again.
For example, as shown in fig. 11, the control circuit 102 includes a controller 103 and an error checker 104.
For example, the error checker 104 is configured to perform a first check on m bits of data read from a target row of n rows of the memory array using an error checking method, resulting in a first check result.
For example, the error checker 104 performs a first check on m bits of data read from a target row of n rows of the memory array by using an error checking method, and when obtaining a first check result, the method includes the following steps: receiving m bits of data stored in a target row read from a storage array and p bits of error check data corresponding to the target row to obtain a first data row; judging whether a bits in m + p bits of the first data line have errors by using an error checking method, wherein a is the maximum error detection bit number which can be detected by the error checking method; and in response to the a bit in the first data row having an error, outputting the first data row to the controller 103, wherein the first verification result comprises the first data row, in response to at most a-1 bit in the first data row having an error, performing correction processing on the first data row, and outputting a correction result to the controller 103, wherein the first verification result comprises the correction result.
For example, the controller 103 is configured to: in response to the first check result indicating that the error occurring in the target row cannot be completely corrected, performing a second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result; combining the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits which can be corrected by the error check method; and taking the second check result as a reading result and outputting the reading result.
For example, when the controller 103 performs the second check on the data of each of the m columns read from the storage array by using the parity check method to obtain the parity check result, the following operations are performed: performing parity check operation on data bits of each of the m columns read from the storage array and the p columns read from the first check array by using a parity check method to obtain a detection parity check vector with m + p bits; reading a reference parity vector from the parity storage row; comparing the detection parity check vector with the reference parity check vector bit-by-bit to determine a plurality of difference bits between the detection parity check vector and the reference parity check vector; a parity result is obtained from the plurality of difference bits.
For example, when the controller 103 performs the combination of the first check result and the parity check result to obtain the second check result corresponding to the target row, the following operations are performed: determining e potential error positions with errors in the first data row according to the parity check result, wherein the e potential error positions are positions of a plurality of difference bits, and e is an integer; and in response to the fact that the e is larger than a-1 and smaller than or equal to a preset correction threshold value, combining the e potential error positions, the error checking method and the first data row, and executing a trial-and-error combined test to obtain a second checking result.
For example, when the controller 103 performs a trial-and-error combination test to obtain a second check result by combining the e potential error bits, the error check method and the first data row, the method includes the following steps: determining at least one correction combination consisting of each a potential dislocation in the e potential dislocations, and performing trial and error combination test on the at least one correction combination; wherein, the trial-and-error combination test includes sequentially carrying out the trial-and-error test for each selected correction combination, and the trial-and-error test includes: turning a data bits corresponding to the a potential dislocation bits included in the selected correction combination in the first data row to obtain an intermediate test data row corresponding to the first data row; sending the intermediate test data row to the error checker 104; in response to receiving the first flag signal sent by the error checker 104, the trial-and-error combination test is stopped, the second check result sent by the error checker 104 is output, and in response to receiving the second flag signal sent by the error checker 104, the trial-and-error test is performed on the next correction combination.
For example, the error checker 104 is further configured to: performing a first check on the intermediate test data line received from the controller by using an error check method, taking the intermediate test data line as a second check result in response to no error in the intermediate test data line, and sending the second check result and a first flag signal to the controller 103, where the first flag signal may be a valid indication signal indicating that the check is successful, for example; responding to the b bit in the middle test data row with an error, correcting the b bit by using an error checking method, taking a correction result as a second checking result, and sending the second checking result and a first mark signal to the controller, wherein b is a positive integer and is less than or equal to a-1; in response to the error of the a bit still existing in the middle test data row, a second flag signal is output to the controller, for example, the second flag signal may be an error indication signal indicating a verification failure.
For example, as shown in fig. 11, the control circuit 102 further includes an error checking code generator 105, and the error checking code generator 105 is configured to generate first checking data for checking the target row by using an error checking method based on the target data to be written to the target row of the n rows of the memory array; the controller 103 is further configured to obtain a reference parity vector by a parity check method according to the target data, wherein the reference parity vector is used for performing parity check on each of the m columns.
For example, control circuit 103 also includes an enable selector MUX1, an address selector MUX2, a read data selector MUX3, and a write data selector MUX4,
the enable selector MUX1 is configured to input an enable signal determined based on a data write request or a read request, or an enable signal generated by the controller 103, to an enable port of the data memory 101 under the control of the controller 103.
Address selector MUX2 is configured to input an address determined based on a data write request or a read request, or an address generated by controller 103, to an address port of data memory 101 under the control of controller 103.
The read data selector MUX3 is configured to input data received from the read data port of the data memory 101, or an intermediate test data line generated by the controller 103, to the error checker 104 under the control of the controller 103.
The write data selector MUX4 is configured to input the target data and the first check data, or the reference parity vector generated by the controller 103, to a write data port of the data memory 101 under the control of the controller 103.
In the soft error processing system provided by at least one embodiment of the present disclosure, a parity storage row is provided in the data storage device, and the position of the a-bit soft error is located with less storage overhead; by designing a peripheral hardware circuit of the data memory, namely a control circuit, possible potential dislocation is processed in a traversal mode, a-bit soft error is ensured to be corrected in time, the probability of system errors is reduced, and the system stability is improved; moreover, the soft error processing system is completely compatible with the existing schemes, such as the interface of a static random access memory, error reporting mechanisms and the like, the interface definition of the soft error processing system does not need to be changed, the change amount of hardware and software is small, and the soft error system is modularized and reconfigurable.
Fig. 12 is a block diagram of a soft error system according to at least one embodiment of the present disclosure.
As shown in FIG. 12, the soft error system includes a data memory 101 and peripheral hardware circuitry, namely a control circuit 102. The control circuit 102 controls the soft error correction process, and the improved data memory 101 is matched to correct the a-bit soft error with less hardware overhead and delay overhead.
The interface of the control circuit 102 is identical to the read-write control circuit of a standard sram, and can replace the sram in the existing device in situ without modifying other circuit logic.
The control circuit 102 includes a controller 103, and the controller 103 is a core of the soft error processing system and is responsible for generating a first read address, a first write address, a first read enable signal, a first write enable signal, and the like, controlling selection of the enable selector MUX1, the address selector MUX2, the read data selector MUX3, and the write data selector MUX4, and controlling a data correction processing flow.
The control circuit 102 further comprises an error checker 104, for example, the error checker 104 is a SECDED checker, for performing single-error-correction double-error detection on the read data line or the intermediate test data line, and outputting a check result and an indication signal, for example, the indication signal comprises a first flag signal and a second flag signal.
For example, the control circuit 102 further includes an error check code generator 105 for generating first check data of the target data to be written, which is the same as a conventional flow of generating a check code of the error check method employed.
For example, the first read address, the first write address, the first read enable signal, and the first write enable signal are generated by the controller 103, and the second read address, the second write address, the second read enable signal, and the second write enable signal are determined by a read request or a write request inputted from the outside.
For example, when reading a parity memory row, a first read address (e.g., when the data memory structure is as shown in fig. 6, the first read address is n) and a first read enable signal are generated by the controller 103, and a reference parity vector is read from the data memory; for example, when calculating the detection parity check vector, the controller 103 generates a first read address and a first read enable signal, and sequentially reads n-1 data rows except for the target row and corresponding error check data in the data memory; for example, when updating a parity memory row, a first write address (e.g., n) and a first write enable signal are generated by the controller 103, and a reference parity vector is written to the data memory; for example, when the target data is written, the controller 103 generates a second write address (e.g., an address of the target row) and a second write enable signal, and writes a second data row to the data memory, for example, the second data row includes the target data and first check data corresponding to the target data.
For example, when a data read request is received, the controller 103 controls the address selector MUX2 and the enable selector MUX1 to select a second read address and a second read enable, respectively, which are determined based on the data read request. For example, when a data write request is received, the controller 103 controls the address selector MUX2 and the enable selector MUX1 to select a second write address and a second write enable, respectively, which are determined based on the data write request. For example, when the data correction related flow is performed, the controller 103 controls the address selector MUX2 and the enable selector MUX1 to select the first read address and the first read enable, respectively.
For example, when a data read request is received, read data selector MUX3 selects the read data line to be fed into error checker 104 for error checking, and when a trial and error combination test is performed, read data selector MUX3 selects the intermediate test data line to be fed into error checker 104 for error checking.
For example, when a data write request is received, the write data selector MUX4 selects the second data row to be written to the data memory, and when the parity memory row is updated, the write data selector MUX4 selects the reference parity vector row to be written to the data memory.
The following describes a processing procedure of the soft error system according to at least one embodiment of the present disclosure with reference to fig. 12.
For example, when receiving a data read request, the controller 103 controls the address selector MUX2 and the enable selector MUX1 to select a second read address (e.g., an address of a target row) and a second read enable, respectively, to read m-bit data and p-bit error check data in the target row in the data memory, the m-bit data and the p-bit error check data forming a first data row according to the storage structure shown in fig. 6, and the controller 103 controls the read data selector MUX3 to send the first data row output from the read data port to the error checker 104 for error checking.
If the error checker 104 determines that at most a-1 bits in the first data row have errors in the error checking process, for example, the first data row has no errors, a first flag signal (for example, 0) is output to the controller 103 to indicate that no data bit error exists, the error checker 104 sends the first data row as a checking result to the controller 103, the controller 103 immediately forwards and outputs the first data row as read data, and sets the error indication signal to 0, sets the valid flag signal to 1, and ends the reading operation of the current round; for example, a 1-bit to a-1-bit error exists in the first data row, a first flag signal (e.g., 1) is output to the controller 103 to indicate that the error exists but is corrected, the error checker 104 performs a correction process on the first data row, outputs the correction result as a check result to the controller 103, and the controller 103 then forwards and outputs the correction result as read data, sets the error indication signal to 0, sets the valid flag signal to 1, and ends the reading operation of the current round.
If the error checker 104 determines that there is an error in the a bit in the first data row during the error checking process, the error checking method cannot correct the data to obtain correct data, and sends the first data row as the checking result to the controller 103, and outputs a second flag signal (e.g., 2) to the controller 103, which indicates that there is an a bit data error and cannot be corrected, and an a bit error correction procedure needs to be started, and the specific procedure is as follows.
First, the controller 103 sequentially reads n-1 data rows in the data memory except for the target row (at this time, the first data row in the target row is input via the branch of the read data port and is stored in the controller 103, so that reading is not needed any more), each data row includes m-bit data and p-bit error check data corresponding to the m-bit data, and the controller 103 calculates the detection parity check vector according to formula 3, which is not described herein again.
Then, the controller 103 reads the reference parity vector stored in the parity storage row, for example, the first read address is n at this time, and compares the detection parity vector with the reference parity vector according to bits to obtain a parity result, and the specific process is as described above and is not described herein again.
Then, the controller 103 determines that e potential error bits with errors exist in the first data row according to the parity check result, if e is less than or equal to a-1 or greater than a preset correction threshold, it indicates that the data cannot be corrected, the controller 103 returns error data, and sets the error indication signal to 1 and the valid indication signal to 0. In this case, it is necessary to read the backup data of the next level or generate interrupt reporting software processing, which is not described herein again.
If e is greater than a-1 and less than or equal to the preset correction threshold, performing a trial-and-error combination test, for example, a distance-first traversal mode may be adopted in the trial-and-error combination test process, and a correction combination with a smaller data bit distance is preferentially selected for traversal, and the detailed trial-and-error test process refers to the content described above and is not repeated here.
If the intermediate test data row corresponding to any one of the correction combinations is input into the error checker 104 for checking, the indication signal output by the error checker 104 is the first flag signal (e.g., 0 or 1), which indicates that correct data is obtained, the controller 103 forwards and outputs the check result output by the error checker 104 and uses the check result as read data, sets the error indication signal to 0, sets the valid flag signal to 1, and ends the reading operation of the current round.
If the middle test data row corresponding to any one of the calibration combinations is input into the error checker 104 for checking, the indication signal output by the error checker 104 is the second flag signal (for example, 2), which indicates that the correct data is not obtained, and the controller 103 continues to perform the trial-and-error test on the next calibration combination.
If the correct data is not obtained after the traversal of the X correction combinations is completed, the verification is failed, the controller 103 returns the error data, sets the error indication signal to 1, and sets the valid indication signal to 0. In this case, it is necessary to read the backup data of the next level or generate interrupt reporting software processing, which is not described herein again.
For example, when a data write request is received, first, the data write request is analyzed to obtain target data and a target row, the target data is input into the error checking code generator 105 to generate first checking data, and the error checking code generator 105 splices the first checking data and the target data according to the storage structure shown in fig. 6 to form a second data row.
Thereafter, the controller 103 controls the write data selector MUX4 to write the second data line to the target line in the data memory, the error checking code generator also sending the second data line to the controller 103; the controller 103 reads the current parity vector stored in the parity storage row, when the first read address is n.
Thereafter, the controller 103 performs an exclusive or operation on the second data line and the current parity vector to obtain a reference parity vector, and controls the write data selector MUX4 to write the reference parity vector into the parity storage line in the data memory, where the first write address is n.
Fig. 13A-13E are schematic diagrams of soft errors provided by at least one embodiment of the present disclosure. In these embodiments, the memory array in the data memory includes 64 bits of data, each 64 bits of data corresponding to 8 bits of error checking data, and each row of data includes 72 bits. For example, in this embodiment, the error checking method adopts the SECDED method, that is, a is 2, although the disclosure is not limited thereto.
The following takes the soft errors shown in fig. 13A-13E as an example, and combines with the soft error handling system shown in fig. 12 to specifically describe the processing procedure under various soft error conditions.
For example, fig. 13A shows a soft error caused by vertical radioactive particles. As shown in FIG. 13A, assuming that a soft error of 3 bits occurs in the local vertical direction, the 3 bits are distributed in the same column (column number 5) in 2-4 rows of the memory array.
As shown in fig. 13A, the address of the target row is 3, the 64-bit data and the 8-bit error check data in the address 3 are read from the data memory 101, the first check is performed on the first data row composed of the 64-bit data and the 8-bit error check data, since only one bit of the first data row has an error, the correction can be performed directly, and the correction result is sent to the controller 103 as a check result, and an indication signal of 1 indicates that correct data is obtained. When monitoring that the indication signal is 1, the controller 103 immediately outputs the verification result output from the error checker 104 as read data, sets the error indication signal to 0, sets the valid flag signal to 1, and ends the reading operation of the current round.
The reading process has no extra delay consumption, and has no extra influence on the performance and the power consumption of a chip system.
It should be noted that, in this embodiment, a description is given by taking an example that soft errors occur in three consecutive storage units in the vertical direction, and of course, the soft errors may also be extended to soft errors in storage units at any position in the vertical direction, and the processing manners are completely the same, which is not described herein again.
For example, fig. 13B shows a soft error caused by a diagonal radioactive particle. As shown in FIG. 13B, assuming a soft error of 3 bits occurs in the local diagonal direction, the 3 bits are distributed over different columns (4,5, 6) of 2-4 rows of the memory array.
As shown in fig. 13B, the address of the target row is 3, the 64-bit data and the 8-bit error check data in the address 3 are read from the data memory 101, the first check is performed on the first data row composed of the 64-bit data and the 8-bit error check data, since only one bit of the first data row has an error, the correction can be performed directly, and the correction result is sent to the controller 103 as a check result, and an indication signal of 1 indicates that correct data is obtained. When monitoring that the indication signal is 1, the controller 103 immediately outputs the verification result output from the error checker 104 as read data, sets the error indication signal to 0, sets the valid flag signal to 1, and ends the reading operation of the current round.
Although there are 3 potential error bits (4,5, 6 columns) in this embodiment, the correction process for 2-bit errors is not initiated because there are only 1-bit soft errors in the read data row. The reading process has no extra delay consumption, and has no extra influence on the performance and the power consumption of a chip system.
It should be noted that, in this embodiment, a description is given by taking an example that soft errors occur in three storage units that are continuous in a diagonal direction, and of course, the soft errors may also be extended to soft errors in storage units at any position in the diagonal direction, and the processing manners are completely the same, which is not described herein again.
For example, fig. 13C shows a soft error caused by radioactive particles in the horizontal direction. As shown in fig. 13C, assuming that a soft error of 2 bits occurs in the local horizontal direction, the 2 bits are distributed on different columns (4,5) of the same row.
As shown in fig. 13C, when the address of the target row is 3, first, 64-bit data and 8-bit error check data in the address 3 are read from the data memory 101, a first check is performed on a first data row composed of the 64-bit data and the 8-bit error check data, and since 2-bit errors occur in the first data row, the error checker 104 cannot directly perform correction at this time, and the error checker 104 transmits the first data row to the controller 103 as a check result and outputs an indication signal with a value of 2 to the controller 103.
When the controller 103 detects that the indication signal is 2, which indicates that a 2-bit soft error occurs in the first data line being read, it controls to start the 2-bit error correction process.
First, the controller 103 sequentially reads the rows 0, 1, 2,4 to 63 except for the address 3 in the data memory 101, and performs an accumulated exclusive-or operation on the read data rows to obtain a detection parity vector.
Then, the controller 103 reads the reference parity vector stored in the parity storage row, and performs an exclusive or operation on the reference parity vector and the detection parity vector to obtain a parity result of 72' b00001100.. 00.
Then, the controller 103 finds that two bits in the parity result are 1, that is, there are 2 potential error bits, that is, bits 4 and 5, in the first data row, indicating that there are errors in the two columns, and then inverts the bits 4 and 5 in the first data row, for example, bits 4 and 5 in the first data row are 1 and 0, respectively, and bits 4 and 5 after inversion become 0 and 1, respectively, so as to obtain an intermediate test data row.
Then, the controller 103 sends the inverted intermediate test data row to the error checker 104 for checking, and at this time, the check passes, and the indication signal output by the error checker 104 is 0.
After that, the controller 103 detects that the indication signal is 0, indicating that correct data has been obtained, immediately outputs the verification result output from the error checker 104 as read data, sets the error indication signal to 0, sets the valid flag signal to 1, and ends the present round of reading operation.
This read process introduces a latency penalty of 63+1+1 ═ 65 clock cycles.
It should be noted that, in this embodiment, a soft error of 2 consecutive bits in the horizontal direction is taken as an example for description, and of course, the soft error may also be extended to a soft error of 2 bits at any position in the horizontal direction, and the processing manners are completely the same, and are not described here again.
For example, fig. 13D shows a soft error caused by radioactive particles in the horizontal direction. As shown in fig. 13D, it is assumed that a local region has a soft error of 4 bits, and although this is rare, this embodiment shows that the soft error processing system provided by the present disclosure can handle similar problems.
As shown in fig. 13D, when the address of the target row is 3, first, 64-bit data and 8-bit error check data in the address 3 are read from the data memory 101, a first check is performed on a first data row composed of the 64-bit data and the 8-bit error check data, and since 2-bit errors occur in the first data row, the error checker 104 cannot directly perform correction at this time, and the error checker 104 transmits the first data row to the controller 103 as a check result and outputs an indication signal with a value of 2 to the controller 103.
When the controller 103 detects that the indication signal is 2, which indicates that a 2-bit soft error occurs in the first data line being read, it controls to start the 2-bit error correction process.
First, the controller 103 sequentially reads the rows 0, 1, 2,4 to 63 except for the address 3 in the data memory 101, and performs an accumulated exclusive-or operation on the read data rows to obtain a detection parity vector.
Then, the controller 103 reads the reference parity vector stored in the parity storage row, and performs an exclusive or operation on the reference parity vector and the detection parity vector to obtain a parity result of 72' b00111100.. 00.
Thereafter, the controller 103 finds that there are four bits of the parity result as 1, i.e. there are 4 potential error bits in the first data row, i.e. bits 2,3, 4 and 5, indicating that there are errors in these columns. However, according to the first check result, only 2 bits in the first data line are erroneous, so 6 correction combinations consisting of any two bits of bits 2,3, 4 and 5 need to be traversed to perform a trial-and-error combination test. According to the distance-first traversal principle, the traversal order of the six correction combinations is as follows: (2,3), (3,4), (4,5), (2,4), (3,5) and (2,5), it takes up to 6 clock cycles. The trial and error combination test is described in detail below:
for the correction combination made of bits 2 and 3: turning over bits 2 and 3 in the first data line, for example, bits 2 and 3 in the first data line are respectively 1 and 0, and bits 2 and 3 after turning over are respectively 0 and 1, thereby obtaining an intermediate test data line; because the actual error bits are 2 and 4, after flipping bits 2 and 3, bit 2 becomes the correct data, but bit 3 now becomes the erroneous data, plus bit 4, which was originally erroneous, and there are still 2 erroneous bits in the middle test data line. Therefore, the controller 103 sends the intermediate test data line to the error checker 104 for checking, and the output check result signal is still 2 (indicating that there are two bit errors uncorrectable). The controller 103 monitors the indication signal from the error checker 104 as 2, indicating that the check result is not correct data, and then continues to try the next correction combination.
For the correction combination made of bits 3 and 4: similar to the above process, when there are 2 bit errors (bits 2 and 3) in the resulting intermediate test data line, the indication signal of the error checker 104 is still 2. The controller 103 monitors the indication signal from the error checker 104 as 2, indicating that the check result is not correct data, and then continues to try the next correction combination.
For the correction combination made of bits 4 and 5: similar to the above process, when there are 2 bit errors (bits 2 and 5) in the resulting intermediate test data line, the indication signal of the error checker 104 is still 2. The controller 103 monitors the indication signal from the error checker 104 as 2, indicating that the check result is not correct data, and then continues to try the next correction combination.
For the correction combination made of bits 2 and 4: similar to the above process, since the bits in the first data row where soft errors occur are exactly bits 2 and 4, the resulting intermediate test data row is correct data, and the indication signal of the error checker 104 is 0. The controller 103 monitors that the indication signal from the error checker 104 is 0, which indicates that the check result output by the error checker 104 is correct data, and then stops the calibration process and does not try any other combinations that have not been traversed.
Thereafter, the controller 103 outputs the verification result output from the error checker 104 as read data, sets the error indication signal to 0, sets the valid flag signal to 1, and ends the present round of reading operation.
This read process introduces a latency penalty of 63+1+ 4-68 clock cycles and reduces the latency of 2 clock cycles due to the use of the distance-first traversal method.
It should be noted that in this embodiment, a plurality of potential error bits occur due to soft errors in the non-target row, but of course, the number of soft errors in the non-target row may also be more than that, as long as the number is not greater than the preset correction threshold E.
For example, FIG. 13E is an example of soft error handling system correction provided by an embodiment of the present disclosure.
As shown in fig. 13E, when the address of the target row is 3, the first data row is first read from the data memory 101 and the first check is performed, and since 2 bits in the first data row have errors, the error checker 104 cannot directly perform the correction at this time, and the error checker 104 sends the first data row as a check result to the controller 103 and outputs an indication signal with a value of 2 to the controller 103.
When the controller 103 detects that the indication signal is 2, which indicates that a 2-bit soft error occurs in the first data line being read, it controls to start the 2-bit error correction process.
First, the controller 103 sequentially reads the rows 0, 1, 2,4 to 63 except for the address 3 in the data memory 101, and performs an accumulated exclusive-or operation on the read data rows to obtain a detection parity vector.
Then, the controller 103 reads the reference parity vector stored in the parity storage row, and performs an exclusive or operation on the reference parity vector and the detection parity vector to obtain a parity result of 72' b001110111100.
Then, the controller 103 monitors that 7 bits in the parity check result are 1, that is, 7 potential error bits exist in the first data row, and if the preset correction threshold is 6, the controller 103 cannot process the data at this time, returns error data, sets the error indication signal to 1, sets the valid flag signal to 0, and ends the reading operation of the current round. Of course, if the preset correction threshold is set to 7, the soft error correction can still be performed by referring to the process described above, and the detailed process is not described again. That is to say, the preset correction threshold may be adjusted according to performance requirements, the larger the preset correction threshold is, the larger the delay spent in traversing the correction combination is, and in practice, a suitable preset correction threshold may be selected according to specific application requirements to balance the stability and delay performance of the data storage.
Corresponding to the data reading method, at least one embodiment of the present disclosure further provides a data reading apparatus, and fig. 14 is a schematic block diagram of the data reading apparatus provided in at least one embodiment of the present disclosure.
For example, the data reading apparatus is applied to a data storage, and for the related content of the data storage, reference may be made to the related description of the data storage in the foregoing data reading method, and repeated descriptions are omitted.
As shown in fig. 14, the data reading apparatus 200 includes: a first verification unit 201, a second verification unit 202, a correction unit 203 and an output unit 204.
A first verifying unit 201 configured to perform a first verification on m bits of data read from a target row of n rows of the memory array by using an error checking method to obtain a first verification result;
a second check unit 202, configured to perform a second check on the data of each of the m columns read from the storage array by using a parity check method in response to the first check result indicating that the error occurring in the target row cannot be completely corrected, and obtain a parity check result;
a correcting unit 203 configured to combine the first check result and the parity check result to obtain a second check result corresponding to the target row, where a number of correction bits of the second check result is greater than a maximum number of error correction bits that can be corrected by the error checking method;
an output unit 204 configured to output the second check result as a read result.
For example, the first verification unit 201, the second verification unit 202, the correction unit 203, and the output unit 204 may be dedicated hardware devices for implementing some or all of the functions of the first verification unit 201, the second verification unit 202, the correction unit 203, and the output unit 204 as described above. For example, the first verification unit 201, the second verification unit 202, the correction unit 203, and the output unit 204 may be one circuit board or a combination of a plurality of circuit boards for implementing the functions as described above. In the embodiment of the present application, the one or a combination of a plurality of circuit boards may include: (1) one or more processors; (2) one or more non-transitory memories connected to the processor; and (3) firmware stored in the memory executable by the processor.
It should be noted that the first checking unit 201 is configured to implement step S10 shown in fig. 5, the second checking unit 202 is configured to implement step S20 shown in fig. 5, the correcting unit 203 is configured to implement step S30 shown in fig. 5, and the output unit 204 is configured to implement step S40 shown in fig. 5. Thus, for the specific description of the first verification unit 201, reference may be made to the description related to step S10 shown in fig. 5 in the embodiment of the data reading method, for the specific description of the correction unit 203, reference may be made to the description related to step S30 shown in fig. 5 in the embodiment of the data reading method, and for the specific description of the output unit 204, reference may be made to the description related to step S40 shown in fig. 5 in the embodiment of the data reading method. In addition, the data reading apparatus can achieve similar technical effects to the data reading method, and will not be described herein again.
Corresponding to the data writing method, at least one embodiment of the present disclosure further provides a data writing device, and fig. 15 is a schematic block diagram of a data writing device according to at least one embodiment of the present disclosure.
For example, the data writing device is used to write data into the data memory, and for the related content of the data memory, reference may be made to the related description of the data memory in the foregoing data reading method, and repeated descriptions are omitted.
For example, as shown in fig. 15, the data writing apparatus 300 includes: a first verification data generation unit 301 and a second verification data generation unit 302.
A first verification data generation unit 301 configured to generate first verification data based on an error verification method for data to be written to a target row of n rows of the memory array, where the first verification data is used for verifying the target row by using the error verification method;
a second parity data generating unit 302 configured to obtain a reference parity vector by a parity check method according to data to be written to the target row, wherein the reference parity vector is used for performing parity check on each of the m columns.
For example, the first and second parity data generating units 301 and 302 may be dedicated hardware devices for implementing some or all of the functions of the first and second parity data generating units 301 and 302 as described above. For example, the first verification data generation unit 301 and the second verification data generation unit 302 may be a circuit board or a combination of a plurality of circuit boards, and are configured to implement the functions described above. In the embodiment of the present application, the one or a combination of a plurality of circuit boards may include: (1) one or more processors; (2) one or more non-transitory memories connected to the processor; and (3) firmware stored in the memory executable by the processor.
It should be noted that the first check data generation unit 301 is configured to implement step S50 shown in fig. 9, and the second check data generation unit 302 is configured to implement step S60 shown in fig. 9. Thus, the description about the first parity data generating unit 301 may refer to the description about the step S50 shown in fig. 9 in the embodiment of the data writing method described above, and the description about the second parity data generating unit 302 may refer to the description about the step S60 shown in fig. 9 in the embodiment of the data writing method described above. In addition, the data writing device can achieve the technical effects similar to those of the data writing method, and the details are not repeated herein.
For the present disclosure, there are also the following points to be explained:
(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.
(2) Thicknesses and dimensions of layers or structures may be exaggerated in the drawings used to describe embodiments of the present invention for clarity. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.
(3) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims (27)

1. A data reading method is applied to a data memory, wherein n data are stored in the data memory, the data width of each data is m bits, the n data are sequentially arranged to form a data array with n x m bits, and the data array is correspondingly stored in the data memory as a storage array,
the data reading method comprises the following steps:
performing first verification on m bits of data read from a target row in n rows of the storage array by using an error verification method to obtain a first verification result;
in response to the first check result indicating that the target row occurred in error that cannot be completely corrected,
performing second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result;
combining the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits that can be corrected by the error check method;
taking the second check result as a read result,
wherein m and n are both positive integers.
2. A data reading method according to claim 1, wherein the data memory is provided with a first check array for the storage array, the first check array includes n error check storage rows corresponding to the n data one to one, each of the n error check storage rows includes p error check bits, and the p error check bits of the ith row of the n error check storage rows are used for storing error check data corresponding to the data of the ith row of the data array;
the data memory is provided with a parity storage row for the storage array and the first check array, wherein the parity storage row comprises m bits corresponding to m columns of the storage array and p bits corresponding to p columns of the first check array, the m bits and the p bits being used for storing reference parity check vectors having m + p bits corresponding to m columns of the data array and p columns of the first check array, respectively,
wherein p and i are both positive integers.
3. The data reading method of claim 2, wherein performing a first check on m bits of data read from a target row of the n rows of the memory array using an error checking method to obtain a first check result comprises:
reading m bits of data stored in the target row and p bits of error check data corresponding to the target row to obtain a first data row;
and performing the first check on the first data line by using the error check method to obtain the first check result.
4. A data reading method according to claim 3, wherein performing a second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result comprises:
performing parity operation on data bits of each of the m columns read from the storage array and the p columns read from the first check array by using a parity method to obtain a detection parity vector with m + p bits;
comparing the detection parity vector to the reference parity vector bit-wise, determining a plurality of difference bits between the detection parity vector and the reference parity vector;
and obtaining the parity check result according to the plurality of difference bits.
5. A data reading method according to claim 4, wherein comparing the detection parity vector with the reference parity vector bit-wise comprises:
performing a bitwise XOR of the detection parity vector and the reference parity vector.
6. A data reading method according to claim 4, wherein in response to the first check result indicating that the error occurred in the target row cannot be fully corrected, the first check result comprises the first data row,
combining the first check result and the parity check result to obtain a second check result corresponding to the target row, including:
determining e potential error bits with errors in the first data row according to the parity check result, wherein the positions of the e potential error bits are the positions of the plurality of difference bits, and e is an integer;
and in response to the fact that e is larger than a-1 and smaller than or equal to a preset correction threshold value, combining the e potential error bits, the error checking method and the first data row, and executing a trial-and-error combination test to obtain the second checking result, wherein a is the maximum error detection bit number which can be detected by the error checking method.
7. A data reading method according to claim 6, wherein performing a trial and error combination test in conjunction with the e potential error bits, the error checking method and the first data row comprises:
determining at least one correction combination consisting of each a of the e potential error positions, and performing the trial-and-error combination test on the at least one correction combination;
wherein the trial and error combination testing comprises performing trial and error testing on each selected correction combination in sequence, the trial and error testing comprising:
turning a data bits corresponding to the a potential error bits included in the selected correction combination in the first data row to obtain an intermediate test data row corresponding to the first data row;
performing the first check on the intermediate test data row using the error checking method,
responding to the intermediate test data row with a-1 bit error at most, processing the intermediate test data row to obtain the second check result, and stopping the trial-and-error combined test,
in response to an error in the still a bits in the intermediate row of test data, performing the trial-and-error test on the next corrected combination.
8. The data reading method according to claim 7, wherein the at least one correction combination performs the trial-and-error test in order of a data bit distance from small to large,
the data bit distance of each correction combination is determined from the distance between the a potential misalignment bits comprised by said each correction combination.
9. The data reading method of claim 7, wherein in response to the intermediate test data row having at most a-1 bit error, processing the intermediate test data row to obtain the second parity result and stopping the trial-and-error combination test comprises:
in response to the intermediate test data row having no error, taking the intermediate test data row as the second check result, and stopping the trial-and-error combined test;
and responding to the b bits in the intermediate test data row with errors, correcting the b bits by using the error checking method, taking a correction result as the second checking result, and stopping the trial-and-error combination test, wherein b is a positive integer and is less than or equal to a-1.
10. The data reading method of claim 1, wherein combining the first check result and the parity check result to obtain a second check result corresponding to the target row comprises:
determining a plurality of potential error bits in which an error exists among m bits of data read from the target row using the parity result;
and in response to that the number of the plurality of potential error bits is equal to the maximum error detection bit number a which can be detected by the error checking method, processing m bits of the data read from the target row according to the plurality of potential error bits to obtain the second checking result.
11. A data reading method according to claim 10, further comprising:
in response to the number of the plurality of potential misalignment bits being within a preset correction range, constructing at least one correction combination based on the plurality of potential misalignment bits, wherein each correction combination consists of a selected a number of potential misalignment bits;
performing a trial and error combination test on the at least one correction combination;
wherein the trial and error combination testing comprises performing trial and error testing on each selected correction combination in sequence, the trial and error testing comprising:
turning over a data bits corresponding to the a potential error bits included in the selected correction combination in the m bits of the data read by the target row to obtain an intermediate test data row,
performing the first check on the intermediate test data row by using the error check method to obtain a first intermediate check result,
obtaining the second check result based on the first intermediate correction result and stopping the trial-and-error combination test in response to the first intermediate check result being a check pass,
and in response to the first intermediate verification result being that the verification fails, performing the trial-and-error test on the next correction combination.
12. A data reading method according to any one of claims 1 to 11, wherein the error checking method is a single error correction double error detection method.
13. A data reading method according to any one of claims 1 to 11, further comprising:
and in response to the first verification result indicating that the error occurring in the target row can be completely corrected, taking the first verification result as the read result.
14. A data writing method for writing data to a data storage,
the data memory is configured to store n data, each of the n data has a data width of m bits, the n data are sequentially arranged to form a data array with n x m bits, and the data array with n x m bits is correspondingly stored as a storage array with n x m bits in the data memory,
the data writing method comprises the following steps:
generating first verification data for target data to be written into a target row in n rows of the storage array based on an error verification method, wherein the first verification data is used for verifying the target row by using the error verification method;
and obtaining a reference parity check vector by using a parity check method based on the target data, wherein the reference parity check vector is used for performing parity check on each of the m columns.
15. The data writing method according to claim 14, wherein the data storage is provided with a first check array for the storage array, the first check array including n error-checking storage rows in one-to-one correspondence with the n data,
the data memory is provided with a parity storage row for the storage array and the first check array,
the data writing method further comprises:
writing the target data to a target row in the storage array;
writing the first check data into an error check storage row corresponding to the target row in the first check array;
writing the reference parity vector to the parity storage row.
16. The data writing method of claim 15, wherein the reference parity vector is obtained using a parity check method based on the target data, including;
reading a current reference check vector stored in the parity storage row,
and carrying out bitwise XOR operation on the current reference check vector, the target data and the first check data to obtain the reference parity check vector.
17. A soft error handling system comprising a data memory and a control circuit, wherein,
the data memory is stored with n data, the data width of each data is m bits, the n data are sequentially arranged to form a data array with n x m bits, and the data array is correspondingly stored as a storage array in the data memory,
the control circuit includes a controller and an error checker,
the error checker is configured to perform a first check on m bits of data read from a target row of the n rows of the memory array by using an error checking method to obtain a first check result;
the controller is configured to:
in response to the first check result indicating that the target row occurred in error that cannot be completely corrected,
performing second check on the data of each of the m columns read from the storage array by using a parity check method to obtain a parity check result;
combining the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits that can be corrected by the error check method;
the second check result is taken as a reading result and output,
wherein m and n are both positive integers.
18. The soft error handling system of claim 17, wherein the data storage is provided with a first check array for the storage array, the first check array comprising n error check storage rows in one-to-one correspondence with the n data, each of the n error check storage rows comprising p error check bits, the p error check bits of an ith row of the n error check storage rows for storing error check data corresponding to the data of an ith row of the data array,
the data memory is provided with a parity storage row for the storage array and the first check array, wherein the parity storage row comprises m bits corresponding to m columns of the storage array and p bits corresponding to p columns of the first check array, the m bits and the p bits are used for storing reference parity check vectors with m + p bits, which correspond to the m columns of the data array and the p columns of the first check array respectively, and both p and i are positive integers.
19. The soft error processing system of claim 18, wherein the error checker performing a first check on m bits of data read from a target row of the n rows of the memory array using an error checking method, resulting in a first check result, comprises performing the following steps:
receiving m bits of data stored in the target row read from the storage array and p bits of error check data corresponding to the target row to obtain a first data row;
judging whether a bits in m + p bits of the first data line have errors by using the error checking method, wherein a is the maximum error detection bit number which can be detected by the error checking method;
outputting the first data row to the controller in response to an error in an a-bit in the first data row, wherein the first check result includes the first data row,
and in response to the error of at most a-1 bits in the first data line, performing correction processing on the first data line, and outputting a correction result to the controller, wherein the first verification result comprises the correction result.
20. The soft error processing system of claim 19, wherein the controller performs a second parity check of the data read from each of the m columns of the storage array using a parity check method, resulting in a parity check result, comprising:
performing parity operation on data bits of each of the m columns read from the storage array and the p columns read from the first check array by using a parity method to obtain a detection parity vector with m + p bits;
reading the reference parity vector from the parity storage row;
comparing the detection parity vector to the reference parity vector bit-wise, determining a plurality of difference bits between the detection parity vector and the reference parity vector;
and obtaining the parity check result according to the plurality of difference bits.
21. The soft error processing system of claim 20, wherein the controller, when performing the combining of the first check result and the parity check result to obtain the second check result corresponding to the target row, comprises performing the following operations:
determining e potential error bits with errors in the first data row according to the parity check result, wherein the positions of the e potential error bits are the positions of the plurality of difference bits, and e is an integer;
and in response to the fact that the e is larger than a-1 and smaller than or equal to a preset correction threshold value, combining the e potential error positions, the error checking method and the first data row, and executing a trial-and-error combination test to obtain the second checking result.
22. The soft error processing system of claim 21, wherein the controller, when performing a trial and error combination test in combination with the e potential error bits, the error checking method, and the first data row to obtain the second check result, comprises performing the following steps:
determining at least one correction combination consisting of each a of the e potential error positions, and performing the trial-and-error combination test on the at least one correction combination;
wherein the trial and error combination testing comprises performing trial and error testing on each selected correction combination in sequence, the trial and error testing comprising:
turning a data bits corresponding to the a potential error bits included in the selected correction combination in the first data row to obtain an intermediate test data row corresponding to the first data row;
sending the intermediate test data row to the error checker;
stopping the trial-and-error combination test in response to receiving a first flag signal sent by the error checker, outputting a second check result sent by the error checker,
and in response to receiving a second flag signal sent by the error checker, performing the trial-and-error test on the next correction combination.
23. The soft error processing system of claim 22, wherein the error checker is further configured to:
performing the first check on the intermediate test data row received from the controller using the error checking method,
in response to the intermediate test data line not having an error, taking the intermediate test data line as the second check result and sending the second check result and the first flag signal to the controller,
correcting the b bits in the intermediate test data row by using the error checking method in response to the b bits being in error, taking a correction result as the second check result, and sending the second check result and the first flag signal to the controller, wherein b is a positive integer and is less than or equal to a-1,
and outputting the second mark signal to the controller in response to the fact that the a bits in the middle test data row have errors.
24. The soft error processing system of any of claims 17-23, wherein the control circuitry further comprises an error checking code generator,
the error check code generator is configured to generate first check data for target data to be written into a target row of n rows of the storage array based on an error check method, wherein the first check data is used for checking the target row by the error check method;
the controller is further configured to obtain a reference parity vector by a parity check method according to the target data, wherein the reference parity vector is used for performing parity check on each of the m columns.
25. The soft error processing system of claim 24, wherein the control circuit further comprises an enable selector, an address selector, a read data selector, and a write data selector,
the enable selector is configured to input an enable signal determined based on a data write request or a read request or an enable signal generated by a controller to an enable port of the data memory under the control of the controller;
the address selector is configured to input an address determined based on a data write request or a read request, or an address generated by the controller, to an address port of the data memory under the control of the controller;
the read data selector is configured to input data received from a read data port of the data memory or an intermediate test data row generated by the controller to the error checker under the control of the controller;
the write data selector is configured to input the target data and the first check data, or the reference parity vector generated by the controller, to a write data port of the data memory under the control of the controller.
26. A data reading device is applied to a data memory, wherein n data are stored in the data memory, the data width of each data is m bits, the n data are sequentially arranged to form a data array with n x m bits, and the data array is correspondingly stored in the data memory as a storage array,
the data reading apparatus includes:
the first checking unit is configured to perform first checking on m bits of data read from a target row in n rows of the storage array by using an error checking method to obtain a first checking result;
a second checking unit, configured to perform a second check on the data of each of the m columns read from the storage array by using a parity checking method in response to the first checking result indicating that the error occurring in the target row cannot be completely corrected, so as to obtain a parity checking result;
a correcting unit configured to combine the first check result and the parity check result to obtain a second check result corresponding to the target row, wherein the number of correction bits of the second check result is greater than the maximum number of error correction bits that can be corrected by the error checking method;
an output unit configured to output the second check result as a read result,
wherein m and n are both positive integers.
27. A data writing device for writing data to a data storage,
the memory is configured to store n data, each of the n data has a data width of m bits, the n data are sequentially arranged to form a data array with n x m bits, and the data array with n x m bits is correspondingly stored in the data memory as a storage array with n x m bits,
the data writing device includes:
a first verification data generation unit configured to generate first verification data for data to be written to a target row of n rows of the storage array based on an error verification method, wherein the first verification data is used for verifying the target row by using the error verification method;
and a second parity data generation unit configured to obtain a reference parity vector by a parity check method according to the data to be written into the target row, wherein the reference parity vector is used for performing parity check on each of the m columns.
CN202111465396.3A 2021-12-03 2021-12-03 Data reading and writing method and device and soft error processing system Pending CN114138544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111465396.3A CN114138544A (en) 2021-12-03 2021-12-03 Data reading and writing method and device and soft error processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111465396.3A CN114138544A (en) 2021-12-03 2021-12-03 Data reading and writing method and device and soft error processing system

Publications (1)

Publication Number Publication Date
CN114138544A true CN114138544A (en) 2022-03-04

Family

ID=80387612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111465396.3A Pending CN114138544A (en) 2021-12-03 2021-12-03 Data reading and writing method and device and soft error processing system

Country Status (1)

Country Link
CN (1) CN114138544A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115016981A (en) * 2022-06-16 2022-09-06 海光信息技术股份有限公司 Setting method of storage area, data reading and writing method and related device
CN117234792A (en) * 2023-11-09 2023-12-15 北京火山引擎科技有限公司 Data verification method, device, equipment and medium for DPU

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115016981A (en) * 2022-06-16 2022-09-06 海光信息技术股份有限公司 Setting method of storage area, data reading and writing method and related device
CN115016981B (en) * 2022-06-16 2023-05-09 海光信息技术股份有限公司 Storage area setting method, data reading and writing method and related devices
CN117234792A (en) * 2023-11-09 2023-12-15 北京火山引擎科技有限公司 Data verification method, device, equipment and medium for DPU
CN117234792B (en) * 2023-11-09 2024-02-09 北京火山引擎科技有限公司 Data verification method, device, equipment and medium for DPU

Similar Documents

Publication Publication Date Title
US11740968B2 (en) Error correction hardware with fault detection
US6781898B2 (en) Self-repairing built-in self test for linked list memories
US8756486B2 (en) Method and apparatus for repairing high capacity/high bandwidth memory devices
US8732551B2 (en) Memory controller with automatic error detection and correction
US20080201620A1 (en) Method and system for uncorrectable error detection
CN114138544A (en) Data reading and writing method and device and soft error processing system
US9092349B2 (en) Storage of codeword portions
US8566672B2 (en) Selective checkbit modification for error correction
JP3039455B2 (en) Semiconductor memory device test method and semiconductor memory device
US8402327B2 (en) Memory system with error correction and method of operation
US8365055B2 (en) High performance cache directory error correction code
JP2009295252A (en) Semiconductor memory device and its error correction method
CN114153648B (en) Data reading and writing method and device and soft error processing system
US11907062B2 (en) Error check scrub operation method and semiconductor system using the same
WO2012046343A1 (en) Memory module redundancy method, storage processing device, and data processing device
EP3882774B1 (en) Data processing device
JP2008176828A (en) Test circuit and test method of error detection correcting circuit
JPH09288619A (en) Main storage device
US11942173B2 (en) Memory apparatus and semiconductor system including the same
JP2993099B2 (en) Redundant memory device
JPS63279348A (en) Check system for memory
JPH06324950A (en) Memory control circuit
JPH01106248A (en) Storage device
JPH01222351A (en) Check system for cache memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination