CN116991626A - Data restoration method and related device - Google Patents

Data restoration method and related device Download PDF

Info

Publication number
CN116991626A
CN116991626A CN202211174901.3A CN202211174901A CN116991626A CN 116991626 A CN116991626 A CN 116991626A CN 202211174901 A CN202211174901 A CN 202211174901A CN 116991626 A CN116991626 A CN 116991626A
Authority
CN
China
Prior art keywords
byte
bit
width value
repair
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211174901.3A
Other languages
Chinese (zh)
Inventor
强鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211174901.3A priority Critical patent/CN116991626A/en
Publication of CN116991626A publication Critical patent/CN116991626A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • G11C29/4401Indication or identification of errors, e.g. for repair for self repair
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/76Masking faults in memories by using spares or by reconfiguring using address translation or modifications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The embodiment of the application discloses a data repairing method and a related device, which are applied to scenes such as high-performance calculation and the like, and timely repair processing of failed bits, so that the correctness of subsequent HBM data reading and writing is ensured. The method comprises the following steps: acquiring first information comprising a first field, wherein the first information is used for indicating that the repair operation identified by the first field is executed on a fault byte in a Dword which needs to be in fault in the HBM; after determining that the repair operation identified by the first field needs to be executed on the fault byte in the faulty Dword according to the first information, acquiring a first bit of the fault byte, wherein the first bit is a bit with the fault in the Dword; acquiring a bit width value of a second bit of a target byte, wherein the target byte is a fault byte or a redundant byte in the HBM; the bit width value of the first bit of the failed byte is updated based on the bit width value of the second bit of the target byte to repair the first bit.

Description

Data restoration method and related device
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a data restoration method and a related device.
Background
The high bandwidth memory (high bandwidth memory, HBM) is a new generation of high bandwidth memory that can be adapted for applications requiring high memory bandwidth. HBM can achieve 2Gbps data transmission speed in 16 nanometer (nm) process, and can achieve 3.2Gbps or 3.6Gbps data transmission speed in 7nm or 5nm process. In chips using HBMs, metal interconnects are typically used to connect the individual functional modules. With the development of modern technology, the line width between the interconnection wires in the chip configured with the HBM is narrower, and the influence of coupling effect and interference noise between the wires is increasing. Moreover, according to the relevant protocol of the HBM, the data read-write speed of the device provided with the third generation high bandwidth memory (HBM 2E) can reach the maximum baud rate of 3.6 GHz. Since the read-write data of HBM 2E will transmit data on both the rising and falling edges of the clock, the actual communication clock frequency of HBM 2E is at most 1.8GHz. At such higher operating frequencies, data transmission can easily lead to errors in data read and write in the fast operating mode if it is disturbed by noise on the data communication link or crosstalk occurs between the data lines. In addition, because the production process of the HBM DRAM is complex, and the HBM DRAM is packaged by adopting a 3D packaging technology, the problem that certain bits in data words are damaged easily in the production and packaging process of the HBM DRAM is also caused, and then the data read-write errors in a fast working mode are caused.
When some bits in the Dword fail or are damaged, if the bits are not repaired in time, the accuracy of data read-write in the subsequent HBM cannot be ensured. Therefore, it is needed to propose a technical solution capable of repairing the failed bit in the Dword.
Disclosure of Invention
The embodiment of the application provides a data repairing method and a related device, which can timely complete the repairing treatment of the bit with the operation fault and ensure the correctness of the subsequent reading and writing of HBM data.
In a first aspect, an embodiment of the present application provides a method for repairing data. The method comprises the following steps: acquiring first information, wherein the first information comprises a first field, the first information is used for indicating that a repair operation identified by the first field needs to be executed on a fault byte in a Dword with operation faults in a high bandwidth memory HBM, the Dword comprises at least two bytes, and the fault byte is a byte in the at least two bytes; after determining that the repair operation identified by the first field needs to be executed on the fault byte in the Dword with the operation fault according to the first information, acquiring a first bit of the fault byte, wherein the first bit is a bit with the operation fault in the Dword; acquiring a bit width value of a second bit of a target byte, wherein the second bit is a bit in the HBM, which has no operation fault, and the target byte is a fault byte or a redundant byte in the HBM; the bit width value of the first bit of the failed byte is updated based on the bit width value of the second bit of the target byte to repair the first bit. It should be noted that, when the values of the first fields are different, it may be identified that the repair operations that need to be performed are different. For example, in the case that the first value is used to indicate the hard repair operation, when the value of the first field is the first value, it is known that the hard repair operation needs to be performed through the first value; similarly, in the case that the second value is used to indicate the soft repair operation, when the value of the first field is the second value, it is known through the second value that the soft repair operation needs to be performed, where the first value is different from the second value.
In a second aspect, an embodiment of the present application provides a data repair apparatus. The data repairing device comprises, but is not limited to, a terminal device, a server and the like. The data restoration device comprises an acquisition unit and a processing unit. The device comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is used for acquiring first information, the first information comprises a first field, the first information is used for indicating that a repair operation identified by the first field needs to be executed on a fault byte in a Dword with operation faults in a high-bandwidth memory HBM, the Dword comprises at least two bytes, and the fault byte is a byte in the at least two bytes; the acquisition unit is used for acquiring first bits of the fault bytes after determining that the repair operation identified by the first field is required to be executed on the fault bytes in the Dword with the running faults according to the first information, wherein the first bits are bits with the running faults in the Dword; the acquiring unit is used for acquiring a bit width value of a second bit of the target byte, wherein the second bit is a bit in the HBM, the operation failure does not occur, and the target byte is a failure byte or a redundant byte in the HBM; and the processing unit is used for updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte so as to repair the first bit.
In some alternative examples, the repair operation includes a hard repair operation. The obtaining unit is further configured to obtain, when the value of the first field is a first value, hard repair confirmation information, where the hard repair confirmation information indicates at least one confirmation condition of whether to perform a hard repair operation on a faulty byte in a Dword that has failed in operation, before updating the bit width value of the first bit of the faulty byte based on the bit width value of the second bit of the target byte to repair the first bit, where the first value is used to indicate the hard repair operation. And the processing unit is used for determining to execute the hard repair operation on the fault byte in the Dword with the running fault according to the hard repair confirmation information.
In other alternative examples, the failed byte is comprised of any one of at least two bytes.
In other alternative examples, the target byte is a redundant byte. The processing unit is used for: when the first bit is the data bus flip DBI signal, the bit width value of the DBI signal is changed to the bit width value of any bit in the redundant byte.
In other alternative examples, the target byte is a failed byte. The processing unit is used for: when the first bit is a data mask DM signal or a data queue DQ signal, the bit width value of the DM signal or the bit width value of the DQ signal is changed to the bit width value of the DBI signal of the fault byte.
In other alternative examples, the failed byte is comprised of a first byte and a second byte, the first byte and the second byte being consecutive two bytes of the at least two bytes.
In other alternative examples, the target byte is a failed byte. The processing unit is used for: if the first bit of the first byte and the first bit of the second byte are both in operation failure, changing the bit width value of the first bit of the first byte to the bit width value of the DBI signal in the first byte and changing the bit width value of the first bit of the second byte to the bit width value of the DBI signal in the second byte.
In other alternative examples, the target byte is a failed byte. The processing unit is used for: if any one of the first bit of the first byte and the first bit of the second byte has an operation fault, changing the bit width value of the first bit of the byte with the operation fault into the bit width value of the DBI signal in the corresponding byte.
In other alternative examples, the target byte is a redundant byte. The processing unit is used for: if the first bit of any one byte in the first bit of the first byte and the first bit of the second byte has operation faults, changing the bit width value of the first bit of the byte with operation faults into the bit width value of any bit in the redundant byte.
In other alternative examples, the processing unit is further configured to: the bit width value of the first bit of the byte without the operation failure is changed to a target value, and the target value is used for indicating that the first bit of the byte without the operation failure is in a disabled state.
In other alternative examples, the acquisition unit is configured to: a first instruction is fetched, the first instruction including a first bit of a failure byte.
In other alternative examples, the processing unit is further configured to: after updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte to repair the first bit, writing data into the repaired first bit; or, the data is read from the repaired first bit.
A third aspect of an embodiment of the present application provides a data repair apparatus, including: memory, input/output (I/O) interfaces, and memory. The memory is used for storing program instructions. The processor is configured to execute the program instructions in the memory to perform the method for repairing data according to the embodiment of the first aspect.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform to execute the method corresponding to the embodiment of the first aspect described above.
A fifth aspect of the embodiments of the present application provides a computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method described above to perform the embodiment of the first aspect described above.
From the above technical solutions, the embodiment of the present application has the following advantages:
in the embodiment of the present application, since the first information can be used to indicate that the repair operation identified by the first field is performed on the faulty byte in the Dword having the running fault in the high bandwidth memory HBM, and the Dword includes at least two bytes, where the faulty byte is a byte in the at least two bytes, after the first information is obtained, and it is determined according to the first information that the faulty byte in the Dword having the running fault performs the repair operation identified by the first field, the first bit of the faulty byte can be obtained, where the first bit is the bit having the running fault in the Dword. Then, the fault byte is taken as a target byte or the redundant byte in the HBM is taken as a target byte, and the bit width value of a second bit of the target byte is obtained, wherein the second bit is a bit in Dword, and no operation fault occurs. Thus, when the target byte is a fault byte or the target byte is a redundant byte in the HBM, the bit width value of the first bit of the fault byte can be updated according to the bit width value of the second bit of the target byte, so that the bit with the operation fault in the Dword is completed, and the fault repair of the first bit is completed. That is, it is understood that the bit width value of the bit with the operation failure can be updated by the bit width value of the bit with the operation failure in the failure byte in the Dword, or the bit width value of the bit with the operation failure in the failure byte can be updated by the bit width value of the bit in the redundancy byte in the HBM, so that the repair processing of the bit with the operation failure can be timely completed, and the accuracy of the subsequent HBM data reading and writing can be ensured.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1A shows a schematic diagram of initiating a write operation to an HBM;
FIG. 1B shows a schematic diagram of a data format for writing data in an HBM DRAM;
FIG. 1C shows a timing diagram when writing data in HBM DRAM;
FIG. 1D shows another timing diagram when writing data in HBM DRAM;
figure 2A shows a schematic diagram of initiating a read operation to an HBM;
FIG. 2B shows a schematic diagram of a data format of read data in an HBM DRAM;
FIG. 2C shows a timing diagram for reading data from the HBM DRAM;
FIG. 2D shows another timing diagram when reading data from the HBM DRAM;
FIG. 3 illustrates waveforms of operation of different types of interface signals in an IEEE1500 interface;
FIG. 4 shows a schematic diagram of a system architecture provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a first flow chart of a method for repairing data according to an embodiment of the present application;
FIG. 6 is a second flow chart of a method for repairing data according to an embodiment of the present application;
FIG. 7 shows a schematic diagram of an embodiment of a data repair device provided in an embodiment of the present application;
fig. 8 is a schematic diagram of a hardware structure of a data repair device according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a data repairing method and a related device, which can timely complete the repairing treatment of the bit with the operation fault and ensure the correctness of the subsequent reading and writing of HBM data.
It will be appreciated that in the specific embodiments of the present application, related data such as user information is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be capable of being practiced otherwise than as specifically illustrated and described. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The high bandwidth memory (high bandwidth memory, HBM) is a new generation of high bandwidth memory that can be adapted for applications requiring high memory bandwidth. The data writing and data reading processes in the HBM will be briefly described with reference to the accompanying drawings.
The write operation of the HBM is typically implemented in the form of a burst (burst), the initiation of which is marked by the transmission of a write (write) instruction. Fig. 1A shows a schematic diagram of initiating a write operation to an HBM. As shown in fig. 1A, the write instruction of HBM has a burst length of 2, and a write operation is initiated by sending a write instruction having a burst length of 2 once. Illustratively, the burst length of the write instruction of the HBM may also be 4, etc., which is not specifically limited in the embodiment of the present application.
When the host needs to write data into the HBM DRAM, it will send a write command to the HBM DRAM. After the HBM DRAM receives the write command sent by the host, the host will write a Data Queue (DQ), a Data Mask (DM), a data bus flip (data bus inversion, DBI), and an associated wideband data sample signal (wideband digital sampling signal) WDQS signal to the HBM DRAM so that the HBM DRAM can use the WDQS signal sample DQ, DM, DBI. Fig. 1B shows a schematic diagram of a data format of writing data in HBM DRAM. As shown in fig. 1B, in the case where the burst length is 4, after receiving the write (write) command, 4 data, such as Da, da+1, da+2, da+3, etc., may be written in the HBM DRAM continuously, which is not limited in the embodiment of the present application. Additionally, tDQSS (min/max) in fig. 1B may represent the minimum and maximum time ranges between the rising edge of wdqs_c and the rising edge of ck_c; alternatively, the minimum and maximum time ranges between the WDQS_t falling edge and the or CK_t falling edge may be represented. tDQSS in fig. 1B describes the time delay between the rising edge of WDQS and the rising edge of CK, tDQSH describes the time delay for the WDQS signal to go high, and tDQSL describes the time delay for the WDQS signal to go low. In addition, tDS describes the setup time between the falling edge of the WDQS signal to the rising edge of CK, tDSH describes the hold time between the falling edge of the WDQS signal to the rising edge of CK, tDS describes the setup time between the write data and the relevant WDQS rising or falling edge, and tDH describes the hold time between the write data and the relevant WDQS rising or falling edge.
Schematically, fig. 1C shows a timing diagram when writing data in HBM DRAM. As shown in fig. 1C, when data is written in the HBM DRAM with a length of 2 (burst length, BL), two consecutive writing of data, such as Da, da+1, can be performed continuously. Similarly, fig. 1D shows another timing diagram when writing data in HBM DRAM. If the data is written in the HBM DRAM with a Burst Length (BL) of 4, four data such as Da, da+1, da+2, and da+3 can be continuously written. It should be noted that, the above description is given by taking the pulse length of 2 and the pulse length of 4 as examples, and the pulse length may be other lengths in practical application, such as 6 and 8, and the embodiment of the present application is not limited to the specific description.
Likewise, the read operation of the HBM is typically also implemented in the form of a burst (burst), the initiation of which is marked by the transmission of a read instruction. Figure 2A shows a schematic diagram of a read operation initiated to an HBM. As shown in fig. 2A, the HBM's read instruction has a burst length of 2, and a read operation is initiated by sending a read instruction with a burst length of 2 once. Illustratively, the burst length of the read instruction of the HBM may also be 4, etc., which is not specifically limited in the embodiment of the present application.
When the HBM DRAM receives a read command sent by the host, the HBM DRAM returns read data DQ, DM, and DBI to the host. Fig. 2B shows a schematic diagram of a data format of read data in the HBM DRAM. As shown in FIG. 2B, after receiving the read command, 4 data such as Da, da+1, da+2, da+3, etc. may be continuously read from the HBM DRAM, which is not limited in the embodiment of the present application.
Schematically, fig. 2C shows a timing diagram when reading data from HBM DRAM. As shown in fig. 2C, when the data is read with a pulse length of 2 in the HBM DRAM, two consecutive data such as Da, da+1 can be read continuously from the HBM DRAM. Similarly, fig. 2D shows another timing diagram when reading data from HBM DRAM. As can be seen from fig. 2D, if the length of pulse length 4 is used to read data in the HBM DRAM, four consecutive data such as Da, da+1, da+2, and da+3 can be continuously read from the HBM DRAM. It should be noted that, the above process of reading data from HBM DRAM is described by taking a pulse length of 2 and a pulse length of 4 as examples, and in practical application, the pulse length may be other lengths, such as 6 and 8, etc., which are not specifically limited in the embodiment of the present application.
As can be seen from fig. 1A to 2D, the HBM DRAM reads and writes data at high speed, and uses DQ as a transfer channel for data reading and writing. It should be noted that, in fig. 1A to 2D, only DQ is taken as an example to describe the transmission behavior when data is read and written in the HBM DRAM, and schematically, for DM and DBI, the transmission behavior is similar to that of DQ, and in particular, it may be understood with reference to the schematic diagrams shown in fig. 1A to 2D, which are not repeated here.
The IEEE1500 interface is a set of interfaces that the HBM provides to the host for performing testing, boundary scan, and repair. The interface signals are shown in table 1 below:
TABLE 1
With respect to the contents shown in table 1 above, the operation waveforms of the different types of interface signals of the IEEE1500 interface can be understood specifically with reference to the schematic diagram shown in fig. 3.
Moreover, in the HBM DRAM, DQ, DM, and DBI signals in the Dword are mainly used as transmission channels at the time of data reading and writing, and the HBM DRAM also provides redundant channels for remapping the Dword. As shown in table 2 below, dword and redundant channel interfaces in HBM DRAM are single channel.
TABLE 2
Function of Data bit width Description of the functionality
DQ 128 bits HBM DRAM write data bus
DM 16 bits Data Mask, write Data Mask signal
DBI 16 bits Data Bus Inversion data bus rollover
RD 8 bits Redundant Data, redundant Data bits
As can be seen from table 2, dword includes DQ, DM and DBI, wherein the maximum data bit width of DQ signal is 128 bits, the maximum data bit width of DM signal is 16 bits, and the maximum data bit width of DBI signal is 16 bits. And RD is used as a redundant channel for remapping Dword in HBM DRAM, and the maximum data bit width is 8 bits.
However, in a chip using HBM DRAM, since the line width between interconnection wires is narrower and narrower, the effects of coupling effect between wires and interference noise are increasing. In addition, since the data read and write of the HBM can be transmitted on both the rising edge and the falling edge of the clock, at a higher operating frequency, if the data transmission is interfered by noise on the data communication link or crosstalk occurs between the data lines, errors in data read and write in the fast operating mode are easily caused. In addition, because the production process of the HBM DRAM is complex, and the HBM DRAM is packaged by adopting a 3D packaging technology, the problem that certain bits in data words are damaged easily in the production and packaging processes of the HBM DRAM is caused, and further, errors occur in data reading and writing in a fast working mode. When some bits in the Dword, such as the 1 st bit in the DQ signal, fail or break down, if the repair is not timely performed, the accuracy of data read/write in the subsequent HBM cannot be guaranteed.
Based on this, in order to solve the above-mentioned technical problems, a method for repairing data is provided in the embodiments of the present application. The method may be applied to the system architecture diagram shown in fig. 4. As shown in fig. 4, the system architecture includes a host and an HBM. Under the condition that the host determines that some bytes in the Dword of the HBM have running faults, the host can obtain a repair instruction and indicate that the repair operation identified by the first field needs to be executed on the faulty bytes in the Dword of the HBM, which have running faults, according to the first information carried in the repair instruction. When the values of the first fields are different, it can be identified that the repair operations to be performed are different. For example, in the case that the first value is used to indicate the hard repair operation, when the value of the first field is the first value, it is known that the hard repair operation needs to be performed through the first value; similarly, in the case that the second value is used to indicate the soft repair operation, when the value of the first field is the second value, it is known through the second value that the soft repair operation needs to be performed, where the first value is different from the second value. Thus, after determining that the repair operation identified by the first field needs to be performed on the faulty byte in the Dword where the operation fault has occurred according to the first information, the host may further obtain the first bit of the faulty byte and obtain the bit width value of the second bit of the target byte. Thus, when the target byte is a fault byte or the target byte is a redundant byte in the HBM, the bit width value of the first bit of the fault byte can be updated according to the bit width value of the second bit of the target byte, so that the bit with operation fault in the Dword is completed, namely the fault repair of the first bit is completed, and the accuracy of the subsequent HBM data reading and writing is ensured.
It should be noted that the above mentioned data repairing method may also be applied to application scenarios such as cloud computing, artificial intelligence, data center, high performance computing, etc., and may specifically be also applied to application scenarios such as storage, etc., which is not limited in the embodiments of the present application. In addition, the host may be a terminal device or a server, which is not specifically limited in the embodiment of the present application. In addition, the terminal device may include, but is not limited to, a smart phone, desktop computer, notebook computer, tablet computer, smart speaker, vehicle-mounted device, smart watch, wearable smart device, smart voice interaction device, smart home appliance, aircraft, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or the like for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (content delivery network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the application is not limited in particular. In addition, the terminal device and the server may be directly connected or indirectly connected by wired communication or wireless communication, and the present application is not particularly limited.
Additionally, the described repair operations may include soft repair operations and hard repair operations. The soft repair operation does not permanently burn the repair mapping table in the HBM DRAM, and the data cannot fail after power failure. The hard repair operation is an online programming repair mode, and can permanently write the hard repair mapping table in the HBM DRAM, and the data can be invalid after power failure, so that the HBM DRAM only has one opportunity to execute the hard repair operation in the use process. The following embodiments will describe the method of data repair provided by the embodiments of the present application from the perspective of soft repair operations and hard repair operations, respectively.
First, performing a hard repair operation
Fig. 5 shows a first flowchart of a method for repairing data according to an embodiment of the present application. As shown in fig. 5, the method for repairing data may include the steps of:
501. and acquiring first information, wherein the first information comprises a first field, the first information is used for indicating that the hard repair operation identified by the first field needs to be executed on the fault byte in the Dword with the running fault in the HBM, the Dword comprises at least two bytes, and the fault byte is one of the at least two bytes.
In this example, the host may detect whether a running exception occurs in a Dword byte in the HBM by sending an instruction or the like before writing data into the HBM or reading data from the HBM; alternatively, the host may detect whether the byte in the Dword is abnormal in operation by means of a flag bit or the like. For example, when the value of the flag bit is "1", it may be indicated that the byte corresponding to the flag bit still operates normally; if the value of the flag bit is 0, it indicates that the byte corresponding to the flag bit is abnormal, so as to determine the fault byte in the Dword. At this time, the host may further acquire the first information when determining that the byte in the Dword has an operation failure, and further determine, according to the indication of the first information, that the repair operation identified by the first field needs to be performed on the failed byte in the Dword in which the operation failure has occurred in the HBM.
Illustratively, since the first information is included in a REPAIR instruction such as hard_lane_repair (hard_lane_repair), the host may execute the REPAIR instruction such as hard_lane_repair through the configuration instruction. Thus, after the repair instruction is analyzed, first information carried in the repair instruction is further obtained, and the first information comprises a first field, wherein the first field can identify corresponding repair operation. Taking the hard_lane_repair instruction as an example, the first information described can be understood with reference to the following description of table 3, namely:
TABLE 3 Table 3
The IEEE1500 interfaces described in table 3 may be specifically understood with reference to the foregoing table 1, and will not be described herein.
In addition, the HBM DRAM includes 8 sets of channels (channels) in total, and dwords may be included in each set of channels. Moreover, WIR [11:8] described in Table 3 above may be used to control channel selection. Illustratively, the specific contents of the WIR [11:8] may be understood with reference to Table 4, namely:
TABLE 4 Table 4
WIR[11:8] Channel selection
Xh Neglecting
0h channel 0
1h channel 1
2h channel 2
3h channel 3
4h channel 4
5h channel 5
6h channel 6
7h channel 7
Eh~8h Reserved word
Fh All channels
As can be seen from Table 4 above, in WIR [11:8], the channel0 in the HBM DRAM can be identified by using the identification 0h, and then after the identification 0h is obtained, the corresponding channel0 can be selected. Similarly, channel 1 in HBM DRAM may be identified using identification 1h, and then after identification 1h is obtained, the corresponding channel 1 may be selected. The described marks 2h, 3h, 4h, 5h, 6h, 7h, etc. can be understood by referring to the descriptions of the marks 0h, 1h described above, and will not be repeated here. In addition, the described identification Fh may be used to identify all channels in the HBM DRAM, so that all channels may be selected by the identification Fh. The identifiers Eh-8 h are reserved fields in the WIR instruction for subsequent addition of other identifiers. Based on this, the host may fetch WIR instructions, etc., to ascertain which channel bits need to be repaired.
In addition, WIR [7:0] in Table 3 above can be understood as the first field. The first field has a first value, for example, 13h, and can be used to indicate a hard repair operation. In other words, when the value of the first field is the first value, it is known from the first value that the hard repair operation needs to be performed. Illustratively, performing a HARD REPAIR operation may also be understood as performing HARD REPAIR instructions such as hard_lane_repair. In combination with table 4, it can be seen from table 3 that, under the condition of the identifier 0h in the WIR [11:8], since the identifier 0h can be used to identify the channel0 in the HBM DRAM, after obtaining the first information, the identifier 0h is determined from the first information, and further it is known that the HARD REPAIR operation of the channel0 of the HBM DRAM is required to be performed on the channel0 identified by the identifier 0h, and further, under the instruction of the HARD REPAIR instruction, the HARD REPAIR operation of the channel0 of the HBM DRAM is completed through the IEEE1500 interface. Similarly, the identifier 1h may be used to identify the channel1 in the HBM DRAM, and after obtaining the first information, the identifier 1h is determined from the first information, so that it is known that the hard_lane_repair instruction needs to be executed on the channel1 identified by the identifier 1h, and then, under the instruction of the hard_lane_repair instruction, the HARD REPAIR operation on the HBM DRAM channel1 is completed through the IEEE1500 interface. Similarly, the hard repair operation is performed on the channels corresponding to the other identifiers in the table 2, and specifically, the hard repair operation may be performed with reference to the channels corresponding to the identifiers 0h and 1h, which is not described herein. It should be noted that, the value of the first field in table 3 is a first value, and the value of the first value is 13h, which is merely an exemplary description, and other values may be used in practical applications, which is not limited in the embodiments of the present application.
In addition, since information used by a computer has both instructions and data, a computer word may represent instructions or data. If some computer words represent data to be processed, they are called data words (dwords). Furthermore, as can be seen from the description of table 2 above, for the 8 sets of channels provided by HBM DRAM, dwords are included in each set of channels. Taking any set of channels as an example, dword includes DQ, DM, and DBI signals, where DQ, DM, and DBI add up to 160 bits. When at most 8 bits of the 160 bits have operation faults, hard repair of the bit with operation faults can be realized by a remapping mode. The described hard repair is an online programming repair, and the repaired data will fail after power failure. Illustratively, as an example, according to the implementation of HBM, the 160 bits may be implemented as 4 independent dwords, and each Dword may be further implemented as 4 bytes (bytes), which may be understood with reference to the following table 5, namely:
TABLE 5
As can be seen from table 5 above, 4 bytes may be included in each Dword, and each byte may include bits of 8 DQ signals, bits of 1 DM signal, and bits of 1 DBI signal. For example, the 160 bits may be divided into 4 dwords, dword0 through Dword3. Wherein for Dword0, 4 bytes, namely Byte0 through Byte3, may be included. And, byte0 may include DQ0 through DQ7, DBI0, and DM0; byte1 may include DQ8 to DQ15, DBI1, and DM1; byte2 may include DQ16 to DQ23, DBI2, and DM2; byte3 may include DQ24 to DQ31, DBI3, and DM3. Similarly, with respect to the contents of Dword1, dword2 and Dword3, the contents of Dword0 may be referred to for understanding, which is not described in detail herein.
It should be noted that the division of the Dword domain into 4 dwords in table 5 is merely an exemplary description. In practical application, along with the reduction of the data bus, the Dword domain may be divided into 2 dwords, etc., and the steps in the embodiment of the present application are specifically limited and described. In addition, the number of the bytes divided in each Dword may be determined according to the actual situation, the present application is not limited to this, and each byte needs to include 8 DQ, 1 DM and 1 DBI signals, and the bit sequence of the 8 DQ may not be limited specifically.
502. After determining that the hard repair operation identified by the first field needs to be executed on the fault byte in the Dword with the operation fault according to the first information, obtaining a first bit of the fault byte, wherein the first bit is a bit with the operation fault in the Dword.
In this example, after receiving the first information and determining that the hard repair operation needs to be performed on the faulty byte in the Dword where the operation fault has occurred according to the indication of the value of the first field in the first information, the host may further obtain the first bit of the faulty byte.
For example, if the flag bit of Byte1 in Dword0 is "0", it indicates that Byte1 is a faulty Byte. At this time, if it is determined that the flag bit value of some bits in Byte1 is also "0" based on a similar manner, it may also be considered that the bit with the flag bit value of "0" has an operation failure, and at this time, the first bit may be determined from Byte1, that is, the first bit is the bit with the operation failure in Dword. For example, if the flag bit of DQ8 in the Byte1 is "0", the DQ8 may be determined as the first bit in which the operation failure occurs. It should be noted that, the process of determining the faulty byte is described only by taking the flag bit as an example, and the process of determining the faulty byte can be specifically understood with reference to the foregoing content in step 501, which is not described herein.
503. And acquiring a bit width value of a second bit of the target byte, wherein the second bit is a bit in the HBM, the operation failure does not occur, and the target byte is a failure byte or a redundant byte in the HBM.
In this example, in the case that the first bit of the failed byte fails, the bit width value of the first bit may be updated using the bit width values of the other bits in the failed byte that do not fail, thereby completing the repair of the first bit. Or, since the HBM provides a redundant channel, that is, the Redundant Data (RD) bit in the foregoing table 2, in this case, if the first bit of the failed byte has an operation failure, the bit width value of the bit in the redundant byte corresponding to the redundant channel may be used to update the bit width value of the first bit, so as to complete the failure repair of the first bit. Therefore, regardless of the manner in which the failover to the first bit is accomplished, the bit width value of the second bit of the target byte needs to be obtained. The described bit width values can be understood as data widths. The described target byte may be a faulty byte or may be a redundant byte corresponding to RD in HBM.
Illustratively, the host may obtain the bit width value of the second bit of the target byte by obtaining the target configuration information. The described target configuration information may be understood with reference to the following table 6, namely:
TABLE 6
As can be seen from table 6, taking DM0 as an example of the bit with the operation failure, DM0 with the operation failure is configured as "0000", then in the repair process, the original DQ0 signal replaces the original QD1, the original DQ1 signal replaces the original DQ2, and so on, the original DQ7 signal occupies the bit where the DBI signal is located. If the repair is performed using the pattern 2 mentioned in the following table 7, DM0 may be repaired by acquiring the bit width value of the bits in the redundancy byte, i.e., using the bit width value of RD. If the repair is performed using the pattern 1 mentioned in the following table 7, the bit width value of the DBI in the corresponding byte can be obtained for repair, and the details of the following step 504 can be understood, which will not be described herein.
It should be noted that the execution sequence of the step 502 and the step 503 is not specifically limited in the embodiment of the present application. For example, step 503 may be performed first, and then step 502 may be performed; alternatively, step 502 and step 503 are performed simultaneously.
504. And when the value of the first field is a first value, acquiring hard repair confirmation information, wherein the hard repair confirmation information represents at least one confirmation condition of whether to execute hard repair operation on a fault byte in the Dword with the running fault, and the first value is used for indicating the hard repair operation.
505. And determining to execute the hard repair operation on the fault byte in the Dword with the running fault according to the hard repair confirmation information.
Because the hard repair is to permanently write the hard repair mapping table in the HBM DRAM, the HBM DRAM only has one opportunity to execute hard repair operation in the using process. Therefore, in order to ensure that the present hard repair is not an erroneous operation of the object configuration, it is necessary to perform a second confirmation before performing the hard repair on the first bit. Illustratively, before updating the bit width value of the first bit of the failed byte according to the bit width value of the second bit of the target byte to repair the first bit, it may also be determined whether the value of the first field is the first value, and the first value is used to indicate a hard repair operation. When the value of the first field is the first value, the need to execute the hard repair operation can be known. Therefore, when the value of the first field is the first value, the hard repair confirmation information is acquired, and the hard repair confirmation information indicates whether the hard repair operation is executed on the fault byte in the Dword with the running fault at least once. Then, it is determined to perform the hard repair operation on the faulty byte in the Dword where the operation fault has occurred according to the hard repair confirmation information. In this way, the hard repair operation can be performed without error. It should be noted that, at least one confirmation described above is understood to be a reconfirmation or a multiple confirmation, and the embodiment of the present application is not limited to the description. The multiple acknowledgements may be three or more acknowledgements, and the present application is not limited thereto.
506. The bit width value of the first bit of the failed byte is updated based on the bit width value of the second bit of the target byte to repair the first bit.
In this example, after confirming again or confirming again the execution of the hard repair operation on the failed byte in the Dword in which the operation failure has occurred based on the hard repair confirmation information, the hard repair operation may be executed on the failed bit in the failed byte using the obtained bit width value of the second bit, that is, the bit width value of the first bit of the failed byte is updated based on the bit width value of the second bit of the target byte.
In the case where the failed byte is made up of any one of at least two bytes in the Dword, or is made up of two consecutive bytes, a different repair mode may be employed to update the bit width value of the first bit in the failed byte. Based on the method, the host can acquire hard repair configuration information corresponding to the Dword, and the bits with operation faults in the fault bytes are subjected to hard repair by adopting different repair modes according to the indication in the hard repair configuration information. Illustratively, for the configuration of different repair modes in each Dword, the following description of table 7 may be referred to first, namely:
TABLE 7
As can be seen from table 7, taking Byte0 in Dword0 as an example, the corresponding hard repair configuration table describes the value of each bit in Byte 0. Where "h9" indicates that when the DBI signal in this Byte0 fails, then a repair using mode 2 is required. The described mode 2 can be specifically understood with reference to the following description of the case (1) in the mode (1) and the case (1) in the mode (2), and will not be described herein. In addition, "hE" indicates that when an operation failure occurs in a bit in any one of two consecutive bytes, the bit width value of the bit in which the operation failure does not occur needs to be set to a disabled state. "hF" indicates that hard repair of other bits that have not failed is not required, and the bit width value of the original bit is still used. It should be noted that the identifier "hE" and the like illustrated in the above table 7 are only illustrative, and other identifiers may be used in practical applications, which is not limited by the embodiment of the present application.
In addition, the hard repair configuration table for each Byte in Dword0, other Dword1, dword2, and Dword3 may also be understood by referring to the hard repair configuration table of Byte0 in Dword0, which is not described herein.
In combination with the contents shown in the above tables 6 and 7, how to hard repair the first bit will be described in detail below for the case of a faulty byte composed of different bytes.
(1) The fault byte is composed of any one byte of at least two bytes
In this example, when a failed byte is composed of only any one byte, in the case where the target byte belongs to a different type of byte, a different repair mode may be employed to update the bit width value of the first bit in the failure. For example, in the case where the target byte belongs to a faulty byte, the bit width value of the first bit may be updated from the bit width value of the bit within the faulty byte where no operation fault occurs, or in the case where the target byte belongs to a redundant byte in the HBM, the bit width value of the bit of the redundant byte may be updated. In the following, how to hard repair the first bit is described in detail for the case of different types of target bytes.
(1) The target byte is a redundant byte
Illustratively, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may be accomplished in the following manner: when the first bit is a DBI signal, the bit width value of the DBI signal is changed to the bit width value of any bit in the redundant byte.
In this example, since the data bit width of the redundancy byte is 8 bits in total, in the case where the failure byte is constituted of only any one byte, the redundancy byte is sufficient to be allocated to the failure byte for hard repair. Therefore, when the first bit with operation fault is determined to be the DBI signal in the fault byte, the bit width value of the DBI signal can be directly changed into the bit width value of any bit in the redundant byte. Alternatively, the bit width value of the DBI signal may be modified to the bit width value of any bit in the redundancy byte.
(2) The target byte is the fault byte
Illustratively, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may be accomplished in the following manner: when the first bit is a Data Mask (DM) signal or a Data Queue (DQ) signal, the bit width value of the DM signal or the bit width value of the DQ signal is changed to the bit width value of the DBI signal of the faulty byte.
In this example, in the case that the target byte is a faulty byte, if an operation fault occurs in the DM signal or the DQ signal in the faulty byte, and the DBI signal does not occur in the operational fault, the bit width value of the DBI signal in the current faulty byte may be directly used to change the bit width value of the first bit in which the operation fault has occurred, that is, to change the bit width value of the DM signal or the DQ signal in the faulty byte.
It should be noted that, after the repair of the first bit is implemented by the repair method of (1) or (2) in the above manner (1), the original DBI function in the faulty byte corresponding to the first bit is not used any more, and the normal use of the DBI function in the other bytes not performing the hard repair operation can still be continued.
(2) The fault byte is composed of two consecutive bytes of at least two bytes
In this example, when a failed byte is composed of a first byte and a second byte, different repair modes may be employed to update the bit width value of the first bit in the failure in the case that the target byte belongs to a different type of byte. For example, when the target byte belongs to a faulty byte, the bit width value of the first bit may be updated from the bit width value of the bit within the faulty byte where no operation fault has occurred, or when the target byte belongs to a redundant byte in the HBM, the bit width value of the bit of the redundant byte may be updated. It should be noted that the first byte and the second byte are described as two consecutive bytes of the at least two bytes. In the following, how to hard repair the first bit is described in detail for the case of different types of target bytes.
(1) The target byte is a redundant byte
Illustratively, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may be accomplished in the following manner: if the first bit of any one byte in the first bit of the first byte and the first bit of the second byte has operation faults, changing the bit width value of the first bit of the byte with operation faults into the bit width value of any bit in the redundant byte.
In this example, since the data bit width of the redundancy bytes is 8 bits in total, the 4 dwords shown in the foregoing table 5 have 16 independent bytes in total, and it is insufficient to allocate one independent redundancy byte for each of the 16 independent bytes for hard repair. At this time, two consecutive bytes may be grouped together, and two consecutive bytes within the same group may be hard repaired. For example, the two bytes can be repaired simultaneously by grouping consecutive Byte0 and Byte1 in Dword0 in table 5; and, taking continuous Byte2 and Byte3 in Dword0 as a group, and repairing at the same time. Similarly, for Dword1, byte0 and Byte1 in the Dword1 may be grouped together, and Byte2 and Byte3 may be grouped together. For Dword2, byte0 and Byte1 in the Dword2 may be grouped together, and Byte2 and Byte3 may be grouped together. For Dword3, byte0 and Byte1 in Dword3 may also be grouped together, and Byte2 and Byte3 may also be grouped together.
In the case of a faulty byte consisting of two consecutive bytes, since the two consecutive bytes are allocated within the same group, each group is allocated only one bit of redundant byte for hard repair, and one bit of redundant byte can repair only one byte. Then, the first bit of either one of the first bit of the first byte and the second bit of the second byte fails, and the redundant byte allocated is sufficient for hard repair. In this way, the bit width value of the first bit of the byte that has failed to operate can be changed to the bit width value of any bit in the redundant byte. For example, if the first bit of the first byte fails and the first bit of the second byte fails, then the redundant byte allocated at this time is sufficient to repair the first byte, which requires changing the bit width value of the first bit of the first byte to the bit width value of any bit in the redundant byte. Similarly, if the first bit of the second byte fails, but the first bit of the first byte does not fail, the bit width value of the first bit of the second byte may be changed to the bit width value of any bit in the redundant byte. By using redundant bytes to repair the bytes that are in operation failure, the full functionality of the DBI signal in the failed bytes can be preserved.
In addition, since the redundant bytes can only be allocated to one byte for repair, for the byte in the same group in which no operation failure occurs, the bit width value of the first bit of the byte in which no operation failure occurs is changed to a target value, and the first bit of the byte in which no operation failure occurs is indicated to be in a disabled state by the target value. The target values described may include, but are not limited to, "hE", where "E" may be understood as a hexadecimal value, i.e., "1110". In some examples, the target value may be another value, which is not limited in the embodiments of the present application.
For example, if it is determined that the fault Byte is formed by Byte0 and Byte1 in Dword0 in the above table 5, and DQ12 in Byte1 is the first bit of the operation fault, the corresponding configuration information may be understood with reference to table 8, that is:
TABLE 8
In combination with the foregoing table 5, it can be known that the DQ12 with the operation failure is the 5 th bit of Byte1 in Dword0, and at this time, the bit width value of the DQ12 can be changed to the bit width value of any bit in the redundant Byte by using the pattern 2, for example, the configuration value becomes "4' h5". Also, since Byte0 and Byte1 are located in two consecutive bits within the same group, it is also necessary to configure the bit width value of the bits in Byte0 in Dword0 to "4' he", i.e., to set each bit in Byte0 to a disabled state. In addition, for bits in other bytes that do not fail to operate, such as: dword3, dword2, and Dword1, and Byte3 and Byte2 in Dword0, the configuration values of the corresponding bits may be set to "hF", i.e., the bit width values of the corresponding bits need not be changed.
Or, if it is determined that the fault Byte is formed by Byte3 in Dword2 in table 5, and DBI11 in Byte3 is the first bit of the operation fault, the corresponding configuration information may be understood with reference to table 9, that is:
TABLE 9
Bit position Domain Configuration value
71:56 DWORD3[15:0] 16’hFFFF
55:40 DWORD2[15:0] 16’h9EFF
39:32 Reserved 16’hFF
31:16 DWORD1[15:0] 16’hFFFF
15:0 DWORD0[15:0] 16’hFFFF
In combination with the foregoing table 5, it can be known that the DBI11 with the operation failure is the 9 th bit of Byte3 in Dword2, and at this time, the bit width value of the DBI11 can be changed to the bit width value of any bit in the redundant Byte by using mode 2, for example, the configuration value becomes "h9". Moreover, since Byte2 and Byte3 are located in two consecutive bits within the same group, it is also necessary to configure the bit width value of the bits in Byte2 in Dword2 to "hE", i.e., to set each bit in Byte2 to a disabled state. In addition, for bits in other bytes that do not fail to operate, such as: dword3, dword0, and Dword1, and Byte0 and Byte1 in Dword2, the configuration values of the corresponding bits may be set to "hF", i.e., the bit width values of the corresponding bits need not be changed.
(2) The target byte is the fault byte
In some examples, updating the bit width value of the first bit of the failed byte according to the bit width value of the second bit of the target byte includes: if the first bit of the first byte and the first bit of the second byte are both in operation failure, changing the bit width value of the first bit of the first byte to the bit width value of the DBI signal in the first byte and changing the bit width value of the first bit of the second byte to the bit width value of the DBI signal in the second byte.
In this example, since the data bit width of the redundancy byte is 8 bits in total, the failed byte is composed of two consecutive bytes (i.e., the first byte and the second byte), and the first byte and the second byte have a byte failure, the allocated redundancy byte is insufficient to be allocated to the failed byte for repair. Therefore, for each byte that fails to operate, the DBI signal inside the respective byte can be used for repair. Illustratively, when the first bit of the first byte fails, the bit width value of the first bit of the first byte may be modified using the bit width value of the DBI signal internal to the first byte. Specifically, when the first bit is a DQ signal or a DM signal in a first byte, the bit width value of the DBI signal in the first byte may be used to update the bit width value of the DQ signal or the DM signal in the first byte. Similarly, for the case where the second bit of the second byte also fails, the bit width value of the DBI signal inside the second byte may also be used to alter the bit width value of the first bit of the second byte. Specifically, when the first bit is a DQ signal or a DM signal in the second byte, the bit width value of the DBI signal in the second byte may be used to update the bit width value of the DQ signal or the DM signal in the second byte.
For example, if it is determined that the fault Byte is formed by Byte2 and Byte3 in Dword0 in the foregoing table 5, and the operation faults occur in both DM2 in Byte2 and DQ25 in Byte3 in Dword0, the corresponding configuration information can be understood with reference to table 10, namely:
table 10
Bit position Domain Configuration value
71:56 DWORD3[15:0] 16’hFFFF
55:40 DWORD2[15:0] 16’h9EFF
39:32 Reserved 16’hFF
31:16 DWORD1[15:0] 16’hFFFF
15:0 DWORD0[15:0] 16’h20FF
By combining the foregoing table 5, it can be known that DM2 in which the operation failure occurs is 9 th bit in Byte2 in Dword0, and DQ25 is 2 nd bit in Byte3, and at this time, the bit width value of DM2 can be modified to the bit width value of DBI signal in Byte2 in mode 1. Likewise, for DQ25 in Byte3, pattern 1 may also be used to modify the bit-width value of the DQ25 to the bit-width value of the DBI signal in Byte 3. In addition, for bits in other bytes that do not fail to operate, such as: dword3, dword2, and Dword1, and Byte0 and Byte1 in Dword0, the configuration values of the corresponding bits may be set to "hF", i.e., the bit width values of the corresponding bits need not be changed.
In other examples, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may also be implemented in the following manner: if any one of the first bit of the first byte and the first bit of the second byte has an operation fault, changing the bit width value of the first bit of the byte with the operation fault into the bit width value of the DBI signal in the corresponding byte. It should be noted that, in this example, how to change the bit width value of the first bit of the byte with the operation failure to the bit width value of the DBI signal in the corresponding byte may be specifically understood by referring to the content of the operation failure of the bits in both bytes, which is not described herein.
In other examples, after performing the repair on the first bit in step 506, the data may be further written into the repaired first bit; or, the data is read from the repaired first bit. It should be noted that, the specific implementation process of writing the data into the repaired first bit may be understood by referring to the foregoing contents of fig. 1A to 1D, which is not repeated here. In addition, the specific implementation process of reading the data from the repaired first bit can be understood by referring to the foregoing contents of fig. 2A to fig. 2D, which is not described herein.
(II) performing a Soft repair operation
Fig. 6 shows a second flowchart of a method for repairing data according to an embodiment of the present application. As shown in fig. 6, the method for repairing data may include the steps of:
601. and acquiring first information, wherein the first information comprises a first field, the first information is used for indicating that the soft repair operation identified by the first field needs to be executed on a fault byte in a Dword with running faults in the HBM, the Dword comprises at least two bytes, and the fault byte is one byte of the at least two bytes.
In this example, the host may detect whether a running exception occurs in a Dword byte in the HBM by sending an instruction or the like before writing data into the HBM or reading data from the HBM; alternatively, the host may detect whether the byte in the Dword is abnormal in operation by means of a flag bit or the like. For example, when the value of the flag bit is "1", it may be indicated that the byte corresponding to the flag bit still operates normally; if the value of the flag bit is 0, it indicates that the byte corresponding to the flag bit is abnormal, so as to determine the fault byte in the Dword. Illustratively, how the host determines the faulty byte in the Dword can be understood with reference to the foregoing content of step 501 in fig. 5, which is not described herein.
Under the condition that the host determines that the bytes in the Dword have operation faults, the host can further acquire first information, and further determine that soft repair operation identified by the first field needs to be executed on the fault bytes in the Dword with operation faults in the HBM according to the value indication of the first field in the first information. When the values of the first fields are different, it can be identified that the repair operations to be performed are different. For example, in the case that the first value is used to indicate the hard repair operation, when the value of the first field is the first value, the need for performing the hard repair operation can be known through the first value, and the embodiment shown in fig. 5 is specifically referred to for understanding, which is not described herein. Similarly, in the case where the second value is used to indicate a soft repair operation, when the value of the first field is the second value, it is known from the second value that the soft repair operation needs to be performed. Wherein the first value is different from the second value.
Illustratively, since the first information is included in a REPAIR instruction such as soft_lane_repair (soft_repair), the host may execute the REPAIR instruction such as soft_lane_repair through the configuration instruction. Thus, after the repair instruction is analyzed, first information carried in the repair instruction is further obtained, and the first information comprises a first field, wherein the first field can identify corresponding repair operation. Taking the soft_lane_repair instruction as an example, the first information described can be understood with reference to the following description of table 11, namely:
TABLE 11
/>
In addition, WIR [7:0] in Table 11 above can be understood as the first field. The first field has a second value, for example, 12h, which can be used to indicate a soft repair operation. In other words, when the value of the first field is the second value, it is known that the hard repair operation needs to be performed through the second value. Illustratively, performing a SOFT REPAIR operation is also understood to be performing SOFT REPAIR instructions such as SOFT_LANE_REPAIR. In combination with table 4, it can be seen from table 11 that, under the condition of the identifier 0h in the WIR [11:8], since the identifier 0h can be used to identify the channel 0 in the HBM DRAM, after the first information is obtained, the identifier 0h is determined from the first information, so that it is known that a soft_lane_repair instruction needs to be executed on the channel 0 identified by the identifier 0h, and then, under the instruction of the soft_lane_repair instruction, the SOFT REPAIR operation on the HBM DRAM channel 0 is completed through the IEEE1500 interface. Similarly, the identifier 1h may be used to identify the channel 1 in the HBM DRAM, and after obtaining the first information, the identifier 1h is determined from the first information, so that it is known that a soft_lane_repair instruction needs to be executed on the channel 1 identified by the identifier 1h, and then, under the instruction of the soft_lane_repair instruction, the SOFT REPAIR operation on the HBM DRAM channel 1 is completed through the IEEE1500 interface. Similarly, the soft repair operation is performed on the channels corresponding to the other identifiers in the table 2, and specifically, the soft repair operation may be performed with reference to the channels corresponding to the identifiers 0h and 1h, which is not described in detail herein. It should be noted that, the value of the first field in table 11 is the second value, and the value of the second value is 12h, which is only an exemplary description, and other values may be used in practical applications, which is not limited in the embodiment of the present application.
In addition, as can be seen from the description of table 2 above, for the 8 sets of channels provided by HBM DRAM, dwords are included in each set of channels. Taking any set of channels as an example, dword includes DQ, DM, and DBI signals, where DQ, DM, and DBI add up to 160 bits. When at most 8 bits of the 160 bits have operation faults, soft repair of the bit with operation faults can be realized by a remapping mode.
In addition, in this example, the dwords are divided, which can be specifically understood with reference to the foregoing descriptions in table 5, and will not be described herein.
602. After determining that the soft repair operation identified by the first field needs to be executed on the fault byte in the Dword with the operation fault according to the first information, acquiring a first bit of the fault byte, wherein the first bit is a bit with the operation fault in the Dword.
In this example, after receiving the first information and determining that the soft repair operation needs to be performed on the faulty byte in the Dword where the operation fault has occurred according to the indication of the value of the first field in the first information, the host may further obtain the first bit of the faulty byte.
For example, if the flag bit of Byte1 in Dword0 is "0", it indicates that Byte1 is a faulty Byte. At this time, if it is determined that the flag bit value of some bits in Byte1 is also "0" based on a similar manner, it may also be considered that the bit with the flag bit value of "0" has an operation failure, and at this time, the first bit may be determined from Byte1, that is, the first bit is the bit with the operation failure in Dword. For example, if the flag bit of DQ8 in the Byte1 is "0", the DQ8 may be determined as the first bit in which the operation failure occurs. It should be noted that, the process of determining the faulty byte is described only by taking the flag bit as an example, and the process of determining the faulty byte can be specifically understood with reference to the foregoing content in step 501 in fig. 5, which is not described herein.
603. And acquiring a bit width value of a second bit of the target byte, wherein the second bit is a bit in the HBM, the operation failure does not occur, and the target byte is a failure byte or a redundant byte in the HBM.
In this example, in the case that the first bit of the failed byte fails, the bit width value of the first bit may be updated using the bit width values of the other bits in the failed byte that do not fail, thereby completing the repair of the first bit. Or, since the HBM provides a redundant channel, that is, the Redundant Data (RD) bit in the foregoing table 2, in this case, if the first bit of the failed byte has an operation failure, the bit width value of the bit in the redundant byte corresponding to the redundant channel may be used to update the bit width value of the first bit, so as to complete the failure repair of the first bit. Therefore, regardless of the manner in which the failover to the first bit is accomplished, the bit width value of the second bit of the target byte needs to be obtained. The described bit width values can be understood as data widths. The described target byte may be a faulty byte or may be a redundant byte corresponding to RD in HBM.
Illustratively, the host may obtain the bit width value of the second bit of the target byte by obtaining the target configuration information. The described target configuration information may be understood with reference to the foregoing table 6, and will not be described herein.
The execution sequence of the step 602 and the step 603 is not particularly limited in the embodiment of the present application. For example, step 603 may be performed first, and then step 602 may be performed; alternatively, step 602 and step 603 are performed simultaneously.
604. The bit width value of the first bit of the failed byte is updated based on the bit width value of the second bit of the target byte to repair the first bit.
In this example, after determining, according to the first information, that the soft repair operation identified by the first field is performed on the faulty bit in the Dword in which the operation fault has occurred, the soft repair operation may be performed on the bit in the faulty byte in which the operation fault has occurred by using the obtained bit width value of the second bit, that is, the bit width value of the first bit of the faulty byte is updated based on the bit width value of the second bit of the target byte.
In the case where the failed byte is made up of any one of at least two bytes in the Dword, or is made up of two consecutive bytes, a different repair mode may be employed to update the bit width value of the first bit in the failed byte. Based on the method, the host can acquire soft repair configuration information corresponding to Dword, and soft repair is carried out on the bit with operation fault in the fault byte by adopting different repair modes according to the indication in the soft repair configuration information. Illustratively, for the configuration of different repair modes in each Dword, the following description of table 7 may be referred to first, namely:
Table 12
/>
As can be seen from table 12, taking Byte0 in Dword0 as an example, the corresponding soft repair configuration table describes the value of each bit in Byte 0. Where "h9" indicates that when the DBI signal in this Byte0 fails, then a repair using mode 2 is required. The described mode 2 can be specifically understood with reference to the content shown in the case (1) in the following mode (3) and the case (1) in the mode (4), and will not be described in detail here. In addition, "hE" indicates that when an operation failure occurs in a bit in any one of two consecutive bytes, the bit width value of the bit in which the operation failure does not occur needs to be set to a disabled state. "hF" indicates that soft repair of other bits that have not failed is not required, and the bit width value of the original bit is still used. It should be noted that the identifier "hE" and the like illustrated in the above table 7 are only illustrative, and other identifiers may be used in practical applications, which is not limited by the embodiment of the present application.
In addition, the soft repair configuration table of each Byte in Dword0, other Dword1, dword2, and Dword3 may also be understood by referring to the soft repair configuration table of Byte0 in Dword0, which is not described herein.
In combination with the contents shown in table 6 and table 12 above, how to soft-repair the first bit is described in detail below for the case of a faulty byte composed of different bytes.
(3) The fault byte is composed of any one byte of at least two bytes
In this example, when a failed byte is composed of only any one byte, in the case where the target byte belongs to a different type of byte, a different repair mode may be employed to update the bit width value of the first bit in the failure. For example, in the case where the target byte belongs to a faulty byte, the bit width value of the first bit may be updated from the bit width value of the bit within the faulty byte where no operation fault occurs, or in the case where the target byte belongs to a redundant byte in the HBM, the bit width value of the bit of the redundant byte may be updated. In the following, how to soft repair the first bit is described in detail for the case of different types of target bytes.
(1) The target byte is a redundant byte
Illustratively, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may be accomplished in the following manner: when the first bit is a DBI signal, the bit width value of the DBI signal is changed to the bit width value of any bit in the redundant byte.
In this example, since the data bit width of the redundancy byte is 8 bits in total, in the case where the failure byte is constituted of only any one byte, the redundancy byte is sufficient to be allocated to the failure byte for soft repair. Therefore, when the first bit with operation fault is determined to be the DBI signal in the fault byte, the bit width value of the DBI signal can be directly changed into the bit width value of any bit in the redundant byte. Alternatively, the bit width value of the DBI signal may be modified to the bit width value of any bit in the redundancy byte.
(2) The target byte is the fault byte
Illustratively, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may be accomplished in the following manner: when the first bit is a Data Mask (DM) signal or a Data Queue (DQ) signal, the bit width value of the DM signal or the bit width value of the DQ signal is changed to the bit width value of the DBI signal of the faulty byte.
In this example, in the case that the target byte is a faulty byte, if an operation fault occurs in the DM signal or the DQ signal in the faulty byte, and the DBI signal does not occur in the operational fault, the bit width value of the DBI signal in the current faulty byte may be directly used to change the bit width value of the first bit in which the operation fault has occurred, that is, to change the bit width value of the DM signal or the DQ signal in the faulty byte.
It should be noted that, after the repair of the first bit is implemented by the repair method of (1) or (2) in the above manner (3), the original DBI function in the faulty byte corresponding to the first bit is not used any more, and the normal use of the DBI function in the other bytes not performing the soft repair operation can still be continued.
(4) The fault byte is composed of two consecutive bytes of at least two bytes
In this example, when a failed byte is composed of a first byte and a second byte, different repair modes may be employed to update the bit width value of the first bit in the failure in the case that the target byte belongs to a different type of byte. For example, when the target byte belongs to a faulty byte, the bit width value of the first bit may be updated from the bit width value of the bit within the faulty byte where no operation fault has occurred, or when the target byte belongs to a redundant byte in the HBM, the bit width value of the bit of the redundant byte may be updated. It should be noted that the first byte and the second byte are described as two consecutive bytes of the at least two bytes. In the following, how to soft repair the first bit is described in detail for the case of different types of target bytes.
(1) The target byte is a redundant byte
Illustratively, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may be accomplished in the following manner: if the first bit of any one byte in the first bit of the first byte and the first bit of the second byte has operation faults, changing the bit width value of the first bit of the byte with operation faults into the bit width value of any bit in the redundant byte.
In this example, since the data bit width of the redundancy bytes is 8 bits in total, the 4 dwords shown in the foregoing table 5 have 16 independent bytes in total, and it is insufficient to allocate one independent redundancy byte for each of the 16 independent bytes for soft repair. At this time, two consecutive bytes may be set as one group, and soft repair may be performed on two consecutive bytes in the same group. For example, the two bytes can be repaired simultaneously by grouping consecutive Byte0 and Byte1 in Dword0 in table 5; and, taking continuous Byte2 and Byte3 in Dword0 as a group, and repairing at the same time. Similarly, for Dword1, byte0 and Byte1 in the Dword1 may be grouped together, and Byte2 and Byte3 may be grouped together. For Dword2, byte0 and Byte1 in the Dword2 may be grouped together, and Byte2 and Byte3 may be grouped together. For Dword3, byte0 and Byte1 in Dword3 may also be grouped together, and Byte2 and Byte3 may also be grouped together.
In the case of a faulty byte consisting of two consecutive bytes, since the two consecutive bytes are allocated within the same group, each group is allocated only one bit of redundant byte for soft repair, and one bit of redundant byte can repair only one byte. Then, the first bit of any one of the first bit of the first byte and the second bit of the second byte fails, and the allocated redundancy byte is sufficient for soft repair. In this way, the bit width value of the first bit of the byte that has failed to operate can be changed to the bit width value of any bit in the redundant byte. For example, if the first bit of the first byte fails and the first bit of the second byte fails, then the redundant byte allocated at this time is sufficient to repair the first byte, which requires changing the bit width value of the first bit of the first byte to the bit width value of any bit in the redundant byte. Similarly, if the first bit of the second byte fails, but the first bit of the first byte does not fail, the bit width value of the first bit of the second byte may be changed to the bit width value of any bit in the redundant byte. By using redundant bytes to repair the bytes that are in operation failure, the full functionality of the DBI signal in the failed bytes can be preserved.
In addition, since the redundant bytes can only be allocated to one byte for repair, for the byte in the same group in which no operation failure occurs, the bit width value of the first bit of the byte in which no operation failure occurs is changed to a target value, and the first bit of the byte in which no operation failure occurs is indicated to be in a disabled state by the target value. The target values described may include, but are not limited to, "hE", where "E" may be understood as a hexadecimal value, i.e., "1110". In some examples, the target value may be another value, which is not limited in the embodiments of the present application.
For example, if it is determined that the fault Byte is formed by Byte0 and Byte1 in Dword0 in the above table 5, and DQ12 in Byte1 is the first bit of the operation fault, the corresponding configuration information may be understood by referring to the content of the above table 8, which is not repeated herein.
Or, if it is determined that the fault Byte is formed by Byte3 in Dword2 in table 5, and DBI11 in Byte3 is the first bit of the operation fault, the corresponding configuration information may be understood with reference to table 13, that is:
TABLE 13
Bit position Domain Configuration value
71:56 DWORD3[15:0] 16’hFFFF
55:40 DWORD2[15:0] 16’h9EFF
39:32 Reserved 16’hFF
31:16 DWORD1[15:0] 16’hFFFF
15:0 DWORD0[15:0] 16’hFFFF
In combination with the foregoing table 5, it can be known that the DBI11 with the operation failure is the 9 th bit of Byte3 in Dword2, and at this time, it can be seen from table 13 that the bit width value of the DBI11 can be changed to the bit width value of any bit in the redundant Byte in mode 2, for example, the configuration value becomes "h9". Moreover, since Byte2 and Byte3 are located in two consecutive bits within the same group, it is also necessary to configure the bit width value of the bits in Byte2 in Dword2 to "hE", i.e., to set each bit in Byte2 to a disabled state. In addition, for bits in other bytes that do not fail to operate, such as: dword3, dword0, and Dword1, and Byte0 and Byte1 in Dword2, the configuration values of the corresponding bits may be set to "hF", i.e., the bit width values of the corresponding bits need not be changed.
(2) The target byte is the fault byte
In some examples, updating the bit width value of the first bit of the failed byte according to the bit width value of the second bit of the target byte includes: if the first bit of the first byte and the first bit of the second byte are both in operation failure, changing the bit width value of the first bit of the first byte to the bit width value of the DBI signal in the first byte and changing the bit width value of the first bit of the second byte to the bit width value of the DBI signal in the second byte.
In this example, since the data bit width of the redundancy byte is 8 bits in total, the failed byte is composed of two consecutive bytes (i.e., the first byte and the second byte), and the first byte and the second byte have a byte failure, the allocated redundancy byte is insufficient to be allocated to the failed byte for repair. Therefore, for each byte that fails to operate, the DBI signal inside the respective byte can be used for repair. Illustratively, when the first bit of the first byte fails, the bit width value of the first bit of the first byte may be modified using the bit width value of the DBI signal internal to the first byte. Specifically, when the first bit is a DQ signal or a DM signal in a first byte, the bit width value of the DBI signal in the first byte may be used to update the bit width value of the DQ signal or the DM signal in the first byte. Similarly, for the case where the second bit of the second byte also fails, the bit width value of the DBI signal inside the second byte may also be used to alter the bit width value of the first bit of the second byte. Specifically, when the first bit is a DQ signal or a DM signal in the second byte, the bit width value of the DBI signal in the second byte may be used to update the bit width value of the DQ signal or the DM signal in the second byte.
For example, if it is determined that the fault Byte is formed by Byte2 and Byte3 in Dword0 in the foregoing table 5, and the operation faults occur in DM2 in Byte2 and DQ25 in Byte3 in Dword0, the corresponding configuration information may be understood by referring to the content of the foregoing table 10, which is not repeated herein. In other examples, updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte may also be implemented in the following manner: if any one of the first bit of the first byte and the first bit of the second byte has an operation fault, changing the bit width value of the first bit of the byte with the operation fault into the bit width value of the DBI signal in the corresponding byte. It should be noted that, in this example, how to change the bit width value of the first bit of the byte with the operation failure to the bit width value of the DBI signal in the corresponding byte may be specifically understood by referring to the content of the operation failure of the bits in both bytes, which is not described herein.
In other examples, after the repair of the first bit is performed in step 604, the data may also be written to the repaired first bit; or, the data is read from the repaired first bit. It should be noted that, the specific implementation process of writing the data into the repaired first bit may be understood by referring to the foregoing contents of fig. 1A to 1D, which is not repeated here. In addition, the specific implementation process of reading the data from the repaired first bit can be understood by referring to the foregoing contents of fig. 2A to fig. 2D, which is not described herein.
In the embodiment of the present application, since the first information can be used to indicate that the repair operation identified by the first field needs to be performed on the faulty byte in the Dword having the operation fault in the high bandwidth memory HBM, and the Dword includes at least two bytes, where the faulty byte is a byte in the at least two bytes, after the first information is obtained, and it is determined according to the first information that the faulty byte in the Dword having the operation fault performs the repair operation identified by the first field, the first bit of the faulty byte can be obtained, where the first bit is the bit having the operation fault in the Dword. Then, the fault byte is taken as a target byte or the redundant byte in the HBM is taken as a target byte, and the bit width value of a second bit of the target byte is obtained, wherein the second bit is a bit in Dword, and no operation fault occurs. Thus, when the target byte is a fault byte or the target byte is a redundant byte in the HBM, the bit width value of the first bit of the fault byte can be updated according to the bit width value of the second bit of the target byte, so that the bit with the operation fault in the Dword is completed, and the fault repair of the first bit is completed. That is, it is understood that the bit width value of the bit with the operation failure can be updated by the bit width value of the bit with the operation failure in the failure byte in the Dword, or the bit width value of the bit with the operation failure in the failure byte can be updated by the bit width value of the bit in the redundancy byte in the HBM, so that the repair processing of the bit with the operation failure can be timely completed, and the accuracy of the subsequent HBM data reading and writing can be ensured.
The foregoing description of the solution provided by the embodiments of the present application has been mainly presented in terms of a method. It should be understood that, in order to implement the above-described functions, hardware structures and/or software modules corresponding to the respective functions are included. Those of skill in the art will readily appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the device according to the method example, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
The following describes the data repairing apparatus in detail in the embodiment of the present application, and fig. 7 is a schematic diagram of an embodiment of the data repairing apparatus provided in the embodiment of the present application. The described data repair device may include, but is not limited to, a server, a terminal device, etc., and the present application is not limited thereto. As shown in fig. 7, the data repair apparatus includes an acquisition unit 701 and a processing unit 702.
The obtaining unit 701 is configured to obtain first information, where the first information includes a first field, and the first information is used to indicate that a repair operation identified by the first field needs to be performed on a faulty byte in a Dword that has an operation fault in the high bandwidth memory HBM, where the Dword includes at least two bytes, and the faulty byte is a byte in the at least two bytes. It should be noted that, when the values of the first fields are different, it may be identified that the repair operations that need to be performed are different. For example, in the case that the first value is used to indicate the hard repair operation, when the value of the first field is the first value, it is known that the hard repair operation needs to be performed through the first value; similarly, in the case that the second value is used to indicate the soft repair operation, when the value of the first field is the second value, it is known through the second value that the soft repair operation needs to be performed, where the first value is different from the second value. It is specifically understood that the foregoing description of step 501 in fig. 5 or step 601 in fig. 6 is referred to, and details are not repeated herein.
An obtaining unit 701, configured to obtain, after determining, according to the first information, that the repair operation identified by the first field needs to be performed on the faulty byte in the Dword where the operation fault has occurred, a first bit of the faulty byte, where the first bit is a bit in the Dword where the operation fault has occurred. It is specifically understood that the foregoing description of step 502 in fig. 5 or step 602 in fig. 6 is referred to, and details are not repeated herein.
An obtaining unit 701, configured to obtain a bit width value of a second bit of the target byte, where the second bit is a bit in the HBM in which no operation failure occurs, and the target byte is a failure byte or the target byte is a redundant byte in the HBM. It is specifically understood that the foregoing description of step 503 in fig. 5 or step 603 in fig. 6 may be referred to, which is not described herein.
The processing unit 702 is configured to update the bit width value of the first bit of the faulty byte based on the bit width value of the second bit of the target byte, so as to repair the first bit. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In some alternative examples, the repair operation includes a hard repair operation; the obtaining unit 701 is further configured to obtain, when the value of the first field is a first value, hard repair confirmation information, where the hard repair confirmation information indicates at least one confirmation of whether to perform a hard repair operation on a faulty byte in a Dword that has undergone an operation fault, before updating the bit width value of the first bit of the faulty byte based on the bit width value of the second bit of the target byte to repair the first bit, where the first value is used to indicate the hard repair operation. The processing unit 702 is configured to determine, according to the hard repair confirmation information, to perform a hard repair operation on a faulty byte in the Dword where an operation fault has occurred. It may be specifically understood with reference to the descriptions of the foregoing steps 504 to 505 in fig. 5, which are not described herein.
In other alternative examples, the failed byte is comprised of any one of at least two bytes.
In other alternative examples, the target byte is a redundant byte. The processing unit 702 is configured to: when the first bit is the data bus flip DBI signal, the bit width value of the DBI signal is changed to the bit width value of any bit in the redundant byte. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In other alternative examples, the target byte is a failed byte. The processing unit 702 is configured to: when the first bit is a data mask DM signal or a data queue DQ signal, the bit width value of the DM signal or the bit width value of the DQ signal is changed to the bit width value of the DBI signal of the fault byte. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In other alternative examples, the failed byte is comprised of a first byte and a second byte, the first byte and the second byte being consecutive two bytes of the at least two bytes.
In other alternative examples, the target byte is a failed byte. The processing unit 702 is configured to: if the first bit of the first byte and the first bit of the second byte are both in operation failure, changing the bit width value of the first bit of the first byte to the bit width value of the DBI signal in the first byte and changing the bit width value of the first bit of the second byte to the bit width value of the DBI signal in the second byte. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In other alternative examples, the target byte is a failed byte. The processing unit 702 is configured to: if any one of the first bit of the first byte and the first bit of the second byte has an operation fault, changing the bit width value of the first bit of the byte with the operation fault into the bit width value of the DBI signal in the corresponding byte. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In other alternative examples, the target byte is a redundant byte. The processing unit 702 is configured to: if the first bit of any one byte in the first bit of the first byte and the first bit of the second byte has operation faults, changing the bit width value of the first bit of the byte with operation faults into the bit width value of any bit in the redundant byte. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In other alternative examples, processing unit 702 is also configured to: the bit width value of the first bit of the byte without the operation failure is changed to a target value, and the target value is used for indicating that the first bit of the byte without the operation failure is in a disabled state. It is specifically understood that the foregoing description of step 506 in fig. 5 or step 604 in fig. 6 is referred to, and details are not repeated herein.
In other optional examples, the obtaining unit 701 is configured to: a first instruction is fetched, the first instruction including a first bit of a failure byte. It is specifically understood that the foregoing description of step 502 in fig. 5 or step 602 in fig. 6 is referred to, and details are not repeated herein.
In other alternative examples, processing unit 702 is also configured to: after updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte to repair the first bit, writing data into the repaired first bit; or, the data is read from the repaired first bit.
The data repair device in the embodiment of the present application is described above from the point of view of the modularized functional entity, and the data repair device in the embodiment of the present application is described below from the point of view of hardware processing. The described data repair device may include, but is not limited to, a server, a terminal device, etc., and the present application is not limited thereto. Fig. 8 is a schematic structural diagram of a data repairing apparatus according to an embodiment of the present application. The data repair device may vary considerably due to configuration or performance differences. The data retrieval device may include at least one processor 801, communication circuitry 807, memory 803, and at least one communication interface 804.
The processor 801 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (server IC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present application.
Communication line 807 may include a pathway to transfer information between the aforementioned components.
Communication interface 804, using any transceiver-like device for communicating with other devices or communication networks, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.
The memory 803 may be a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that may store information and instructions, and the memory may be stand-alone and coupled to the processor via a communication line 807. The memory may also be integrated with the processor.
The memory 803 is used for storing computer-executable instructions for performing the aspects of the present application, and is controlled by the processor 801 for execution. The processor 801 is configured to execute computer-executable instructions stored in the memory 803, thereby implementing the method for repairing data provided in the above-described embodiment of the present application.
Alternatively, the computer-executable instructions in the embodiments of the present application may be referred to as application program codes, which are not particularly limited in the embodiments of the present application.
In a specific implementation, as an embodiment, the data repair device may include multiple processors, such as processor 801 and processor 802 in fig. 8. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In a specific implementation, as an embodiment, the data repairing apparatus may further include an output device 805 and an input device 806. An output device 805 communicates with the processor 801 and can display information in a variety of ways. The input device 806 is in communication with the processor 801 and may receive input of a target object in a variety of ways. For example, the input device 806 may be a mouse, a touch screen device, a sensing device, or the like.
The data retrieval device described above may be a general purpose device or a special purpose device. In a specific implementation, the data repair device may be a server, a terminal, etc. or a device having a similar structure as in fig. 8. The embodiment of the application is not limited to the type of the data restoration device.
It should be noted that the processor 801 in fig. 8 may cause the data repairing apparatus to perform the method of repairing data in the method embodiment corresponding to fig. 5 or fig. 6 by calling the computer-executable instructions stored in the memory 803.
In particular, the functions/implementation of the processing unit 702 in fig. 7 may be implemented by the processor 801 in fig. 8 invoking computer executable instructions stored in the memory 803. The functions/implementation of the acquisition unit 701 in fig. 7 may be implemented by the communication interface 804 in fig. 8.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof, and when implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When the computer-executable instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be stored by a computer or data storage devices such as servers, data centers, etc. that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., SSD)), or the like.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (16)

1. A method of data repair, comprising:
acquiring first information, wherein the first information comprises a first field, the first information is used for indicating that a repair operation identified by the first field needs to be executed on a fault byte in a Dword with operation faults in a high bandwidth memory HBM, the Dword comprises at least two bytes, and the fault byte is a byte in the at least two bytes;
after determining that the repair operation identified by the first field needs to be executed on the fault byte in the Dword with the running fault according to the first information, acquiring a first bit of the fault byte, wherein the first bit is a bit with the running fault in the Dword;
Acquiring a bit width value of a second bit of a target byte, wherein the second bit is a bit in the HBM, in which the operation fault does not occur, and the target byte is the fault byte or the target byte is a redundant byte in the HBM;
and updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte so as to repair the first bit.
2. The method of claim 1, wherein the repair operation comprises a hard repair operation, the method further comprising, prior to updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte to repair the first bit:
when the value of the first field is a first value, obtaining hard repair confirmation information, wherein the hard repair confirmation information represents at least one confirmation condition of whether the hard repair operation is executed on a fault byte in the Dword with the running fault or not, and the first value is used for indicating the hard repair operation;
and determining to execute the hard repair operation on the fault byte in the Dword with the running fault according to the hard repair confirmation information.
3. The method according to claim 1 or 2, wherein the faulty byte is constituted by any one of the at least two bytes.
4. A method according to claim 3, wherein the target byte is the redundant byte; the updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte comprises:
when the first bit is a data bus flip DBI signal, changing the bit width value of the DBI signal into the bit width value of any bit in the redundant byte.
5. A method according to claim 3, wherein the target byte is the faulty byte; the updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte comprises:
when the first bit is a data mask DM signal or a data queue DQ signal, changing the bit width value of the DM signal or the bit width value of the DQ signal into the bit width value of the DBI signal of the fault byte.
6. The method according to claim 1 or 2, wherein the faulty byte is constituted by a first byte and a second byte, the first byte and the second byte being consecutive two bytes of the at least two bytes.
7. The method of claim 6, wherein the target byte is the failed byte; the updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte comprises:
if the first bit of the first byte and the first bit of the second byte have operation faults, changing the bit width value of the first bit of the first byte into the bit width value of the DBI signal in the first byte, and changing the bit width value of the first bit of the second byte into the bit width value of the DBI signal in the second byte.
8. The method of claim 6, wherein the target byte is a failed byte; the updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte comprises:
and if the first bit of any one byte in the first bit of the first byte and the first bit of the second byte has operation faults, changing the bit width value of the first bit of the byte with operation faults into the bit width value of the DBI signal in the corresponding byte.
9. The method of claim 6, wherein the target byte is the redundant byte; the updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte comprises:
And if the first bit of any one byte in the first bit of the first byte and the first bit of the second byte has operation faults, changing the bit width value of the first bit of the byte with operation faults into the bit width value of any bit in the redundant byte.
10. The method according to claim 8 or 9, characterized in that the method further comprises:
and changing the bit width value of the first bit of the byte without the operation fault into a target value, wherein the target value is used for indicating that the first bit of the byte without the operation fault is in a disabled state.
11. The method according to claim 1 or 2, wherein said obtaining the first bit of the faulty byte comprises:
a first instruction is fetched, the first instruction comprising a first bit of the faulty byte.
12. The method of claim 1 or 2, wherein after the updating the bit width value of the first bit of the failed byte based on the bit width value of the second bit of the target byte to repair the first bit, the method further comprises:
writing the data into the repaired first bit; or alternatively, the first and second heat exchangers may be,
and reading the data from the repaired first bit.
13. A data repair device, comprising:
an obtaining unit, configured to obtain first information, where the first information includes a first field, where the first information is used to indicate that a repair operation identified by the first field needs to be performed on a faulty byte in a Dword that has an operation fault in the high bandwidth memory HBM, where the Dword includes at least two bytes, and the faulty byte is a byte in the at least two bytes;
the obtaining unit is configured to obtain a first bit of the failure byte after determining, according to the first information, that a repair operation identified by the first field needs to be performed on the failure byte in the Dword where the operation failure occurs, where the first bit is a bit of the Dword where the operation failure occurs;
the acquiring unit is configured to acquire a bit width value of a second bit of a target byte, where the second bit is a bit in the HBM in which the operation failure does not occur, and the target byte is the failure byte or the target byte is a redundant byte in the HBM;
and the processing unit is used for updating the bit width value of the first bit of the fault byte based on the bit width value of the second bit of the target byte so as to repair the first bit.
14. A data repair device, comprising: an input/output (I/O) interface, a processor, and a memory, the memory having program instructions stored therein;
the processor is configured to execute program instructions stored in a memory to perform the method of any one of claims 1 to 12.
15. A computer readable storage medium comprising instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 12.
16. A computer program product comprising instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 12.
CN202211174901.3A 2022-09-26 2022-09-26 Data restoration method and related device Pending CN116991626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211174901.3A CN116991626A (en) 2022-09-26 2022-09-26 Data restoration method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211174901.3A CN116991626A (en) 2022-09-26 2022-09-26 Data restoration method and related device

Publications (1)

Publication Number Publication Date
CN116991626A true CN116991626A (en) 2023-11-03

Family

ID=88528982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211174901.3A Pending CN116991626A (en) 2022-09-26 2022-09-26 Data restoration method and related device

Country Status (1)

Country Link
CN (1) CN116991626A (en)

Similar Documents

Publication Publication Date Title
JP2005528712A (en) Transparent ECC memory system
KR20060133892A (en) Memory mirroring apparatus and method
US11687430B2 (en) Method and apparatus for offloading functional data from an interconnect component
DE102019129275A1 (en) Method, circuit and integrated circuit for transferring data and a data test field
CN108829619A (en) A kind of continuous topological structure of hard disk ID and hard disk ID localization method
CN110580235B (en) SAS expander communication method and device
US9891986B2 (en) System and method for performing bus transactions
CN111104246B (en) Method, device, computer equipment and storage medium for improving verification efficiency of error detection and correction of DRAM
CN109376028B (en) Error correction method and device for PCIE (peripheral component interface express) equipment
CN116991626A (en) Data restoration method and related device
US7363565B2 (en) Method of testing apparatus having master logic unit and slave logic unit
US8463952B1 (en) Device connections and methods thereof
CN109117302A (en) A kind of internal memory data acquiring method, system, Memory Management Middleware and medium
CN109710187A (en) Read command accelerated method, device, computer equipment and the storage medium of NVMe SSD main control chip
US6125407A (en) System for flushing high-speed serial link buffers by ignoring received data and using specially formatted requests and responses to identify potential failure
CN114020525A (en) Fault isolation method, device, equipment and storage medium
TW201506942A (en) Address wire test system and method
CN106411564A (en) Apparatus and method for detecting ethernet frame
WO2020019255A1 (en) Method for data block processing and controller
CN101957781A (en) Remote aid memory testing method
KR102427323B1 (en) Semiconductor memory module, semiconductor memory system, and access method of accessing semiconductor memory module
CN209784996U (en) control circuit
JPH07154451A (en) Scanning programmable check matrix for system interconnection use
US11809221B2 (en) Artificial intelligence chip and data operation method
US10579470B1 (en) Address failure detection for memory devices having inline storage configurations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination