WO2021179213A1 - Procédé et dispositif de réparation de puce de mémoire - Google Patents
Procédé et dispositif de réparation de puce de mémoire Download PDFInfo
- Publication number
- WO2021179213A1 WO2021179213A1 PCT/CN2020/078839 CN2020078839W WO2021179213A1 WO 2021179213 A1 WO2021179213 A1 WO 2021179213A1 CN 2020078839 W CN2020078839 W CN 2020078839W WO 2021179213 A1 WO2021179213 A1 WO 2021179213A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- temperature
- chip
- repair
- data unit
- failed data
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/401—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
- G11C11/406—Management or control of the refreshing or charge-regeneration cycles
Definitions
- This application relates to the field of computers, and in particular to a method and device for repairing a storage chip.
- the chip is a key component of the computer system, which can control the operation of the computer system and the reading of data.
- Dynamic random access memory (DRAM) is a large-capacity and high-density semiconductor memory. As the scale of DRAM chips becomes larger and the operating frequency becomes higher and higher, there will be varying degrees of local failure probability for the chips no matter in the chip production process or in the working state of the chip. As a basic technology that can effectively improve chip yield, the repair of DRAM failed cells is playing an increasingly important role in the DRAM field.
- the DRAM repair technology often requires the chip to be offline before performing a self-check operation, which makes the above-mentioned repair technology unable to perform self-check and/or repair operations on the failure of the chip in the working state.
- a method of repairing DRAM chips online is provided in the existing solutions, mainly by detecting the chips in the working state. When the failed data units are found, the failed data units are replaced.
- the failure situation is constantly changing, and the repair ability of this replacement method is limited. Therefore, the existing solution cannot cope with the constantly changing failure situation of the chip in the working state.
- the present application provides a method and device for repairing a storage chip, which can adjust the repair method according to the continuously changing failure conditions of the chip in the working state, so as to make full use of repair resources and improve the repair effect of the chip.
- a method for repairing a memory chip includes: acquiring self-test parameters of the chip, and performing repair operations on the chip according to the self-test parameters, wherein the self-test parameters include temperature and failure information, and the failure information includes The address of the failed data unit and the error detection information of the failed data unit.
- the aforementioned invalidation information may include the information of the invalidation data unit, for example, it may include the address of the invalidation data unit, and may also include the error detection information of the invalidation data unit.
- the error detection information will be further explained below.
- the error detection information may include, for example, the number of times that the invalid data unit is detected as an error during the detection process, for example, it may be the number of errors detected by the ECC mechanism.
- the error detection information may include the cumulative number of errors detected in the failed data unit, the last time the error was detected, the number of consecutive errors detected in the failed data unit, and so on.
- a failure judgment criterion can be set, that is, an evaluation criterion for judging whether a certain data unit is invalid.
- the number of errors can be used as the criterion to set a threshold of the number of errors. When the number of errors in operations such as reading and writing of a data unit is greater than or equal to the threshold of the number of errors, it is determined that the data unit is invalid. When the number of operation errors such as reading and writing of a data unit is less than the threshold of the number of errors, it is determined that the data unit is normal.
- the temperature when the chip is repaired according to the self-check parameters, the temperature can be compared with a set temperature threshold to perform different repair operations.
- the refresh period of the chip can be shortened, so that the failed data unit caused by the temperature increase no longer fails.
- the temperature After the temperature is obtained, it can be judged whether the temperature is greater than or equal to the first temperature threshold.
- the refresh period can be adjusted so that the invalid data unit caused by the temperature rise is no longer invalid. In this case, the temperature is higher than the first temperature threshold.
- many failed data units in the various components of the chip may be caused by temperature. This failure can be solved by refreshing. Therefore, the refresh of the chip can be shortened. The cycle enables the data unit to be refreshed in time, thereby solving the failure. In the foregoing implementation manner, the more serious failure situation caused by the excessive temperature can be solved by shortening the refresh cycle.
- the data of the memory chip is stored in the capacitor in the form of electric charge, so it needs to be constantly refreshed to supplement the electric charge of the capacitor.
- the holding time of the charge in the capacitor will change with the change of temperature. In some cases, it can be understood that the higher the temperature, the easier the capacitor will leak, resulting in faster charge leakage, and the faster the stored data will disappear.
- the refresh cycle is unchanged, it is equivalent to not charging the capacitor in time, causing the data unit to fail. At this time, if the refresh cycle is shortened, the capacitor can be charged in time, so that the data can be refreshed in time, and the data unit is invalidated. No longer expires.
- the first temperature threshold may be determined by comparing the number of failed data units at different temperatures. For example, assuming that the number of failed data units is 20 when the temperature is 30°C, and the number of failed data units reaches 200 when the temperature is 75°C, it can indicate that the increase in temperature causes a large proportion of failures, and it is very likely The maximum number of repairs that has far exceeded the repair capability, shortening the refresh cycle can effectively solve this more serious failure situation.
- the first temperature threshold can also be determined by comparing the difference between the number of failed data units at different temperatures and the number of failed data units at the factory.
- the first temperature threshold can also be determined by comparing the difference between the number of failed data units at different temperatures and the maximum number of repairs corresponding to the maximum repair capability.
- the first temperature threshold may also be determined according to the upper temperature limit of the normal operation of the chip. For example, if it is known that the upper limit of the normal working temperature of the chip is 85°C, for example, the first temperature threshold can be set to a value between 70°C and 80°C to try to ensure that the chip can still work normally through repair .
- the temperature parameters such as the normal working temperature range of the chip, the upper limit of the normal working temperature, and the lower limit of the normal working temperature all depend on the performance of the chip itself. Temperature parameters may be different.
- the upper limit of the normal working temperature of the chip may also be 90°C, 82°C, 100°C, 105°C, etc., or other upper temperature limits or lower temperature limits. This is not limited.
- the data unit failure may be caused by temperature or other factors, such as physical damage.
- the temperature usually does not change suddenly, so for failures that are not caused by temperature (that is, other factors), it may have been found and repaired during the self-test before the temperature reaches the first threshold; and if the temperature rises Failures caused by other factors that occur at the same time can still be found and repaired in the follow-up self-inspection.
- data units that are still invalid can be detected in subsequent self-checks, and for example, such invalid data units can be repaired further based on the invalidation information. That is to say, for example, a certain selection can be made for failed data units that are still invalid after shortening the refresh period, and some of the failed data units can be selected for priority repair processing, or the failed data units can be repaired in a certain order. For example, it can be determined to repair part or all of this part of the failed data unit according to the repair capability and the actual number of failed data units. For example, the number of repairs may not be considered, and the failed data units can be repaired in the order of the number of errors, which means priority. For repairing failed data units with a large number of errors, for example, it is possible to set a priority to repair failed data units in a certain area of the chip.
- part of the failed data units can also be repaired according to the failure information.
- the failed data unit with the most errors in the failure information can be repaired first, and for example, the failed data unit of certain parts of the chip or a specific area of the chip can be repaired first.
- the above-mentioned temperature-induced failure can be understood as a kind of "soft failure”, that is, it is a failure that can be eliminated by changing the chip parameters.
- the embodiment of the present application mainly takes the "soft failure” caused by temperature as Examples are introduced, but for other "soft failures", the same method (adjusting parameters) can also be used to make the failed data unit no longer invalid, so I won't repeat it here.
- the shortened refresh period can be extended or restored, where the second temperature threshold is less than the first temperature threshold.
- the parameters that have been changed when the temperature is higher can also be restored, for example, the refresh period can be appropriately extended. That is to say, assuming that in the detection process at the previous moment, the temperature is greater than or equal to the first temperature threshold, an operation such as shortening the refresh period is adopted, so that the refresh period is shortened.
- the shortened refresh period can be appropriately extended or the refresh period can be restored to the value before the shortening.
- the temperature when the temperature is less than the first temperature threshold, it may be determined to repair part or all of the failed data unit according to the failure information and/or the temperature.
- the temperature when the temperature is less than the first temperature threshold, it can be further determined whether the temperature is greater than the second temperature threshold, and corresponding operations are performed. For example, when the temperature is greater than the second temperature threshold and less than the first temperature threshold, it means that the failed data unit to be repaired may not be repaired because the repair capability exceeds the repair capability. Therefore, it can be detected in the self-test process. Make certain selections for the failed data units, and select part of the failed data units to be repaired first, or repair the failed data units in a certain order.
- the determination of the relationship with the second temperature threshold may not be performed, but when the temperature is less than the first temperature threshold, it is determined to repair some or all of the failed data units according to the failure information and/or temperature, or according to the failure data The repair sequence of the unit is repaired.
- the first aspect it is also possible to repair all failed data units when the temperature is less than or equal to the second temperature threshold.
- the temperature is less than or equal to the second temperature threshold, it is equivalent to that the repair capability is sufficient to repair the failed data units that may occur.
- operations such as selecting the failed data units may not be performed, which is equivalent to repairing as many failed data units as there are. Yes, but it should be understood that at this time, it can also be set as a priority processing part of the invalid data unit, and the repair effect will not be affected.
- the failed data unit to be repaired can be selected from the failed data units to be repaired according to the failure information.
- the criterion for determining whether the data unit is invalid can also be adjusted according to parameters such as temperature and refresh period. For example, when the probability of failure is low, the criterion can be set to be relatively strict, and when the probability of failure is high, the criterion can be set to be relatively loose.
- a device for repairing a memory chip includes a unit for executing the method of any one of the implementation manners of the above-mentioned first aspect.
- the device includes an acquisition unit and a processing unit.
- a chip in a third aspect, includes a device in any one of the implementation manners provided in the second aspect.
- a chip in a fourth aspect, includes a processor and a data interface.
- the processor reads instructions stored in a memory through the data interface, and executes any one of the implementations in the first aspect. method.
- the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
- the processor is configured to execute the method in any one of the implementation manners in the first aspect.
- a computer-readable medium stores program code for device execution, and the program code includes a method for executing any one of the implementation manners in the first aspect.
- a computer program product containing instructions is provided.
- the computer program product runs on a computer, the computer executes the method in any one of the above-mentioned first aspects.
- FIG. 1 is a schematic diagram of the application of a method for repairing a chip in the prior art.
- FIG. 2 is an application schematic diagram of a method for repairing a chip provided by an embodiment of the present application.
- FIG. 3 is a schematic flowchart of a method for repairing a chip provided by an embodiment of the present application.
- FIG. 4 is a schematic flowchart of a repair operation of a chip according to a self-check parameter according to an embodiment of the present application.
- 5 to 8 are schematic diagrams of the process of repairing the chip according to the self-check parameters provided by the embodiments of the present application.
- FIG. 9 is a schematic block diagram of an apparatus for repairing a chip according to an embodiment of the present application.
- the method for repairing chips provided by the embodiments of the present application can be applied to the repair of various types of chips, especially to the repair of memory chips.
- the memory chips may include, for example, DRAM chips.
- the functions in the chip will be divided into one or more dies, for example, divided into logic die and memory die.
- the memory die can be, for example, a DRAM die. (DRAM die), in the embodiment of the present application, the introduction is mainly given by taking the memory chip as a DRAM chip as an example.
- FIG. 1 is a schematic diagram of the application of a solution for repairing a chip in the prior art
- FIG. 2 is a schematic diagram of the application of a solution for repairing a chip provided by an embodiment of the present application.
- chip repair is implemented by a chip repair device 100, which may include a control logic module 101, an address management module 102, a read and write data processing module 103, and a built-in self-check control module 104.
- the device 100 can be set in a logic chip to repair failures in the DRAM chip.
- the device 100 needs to perform operations such as self-check and repair when the chip is offline, that is, the chip needs to be in an inoperative state, for example, when the chip is tested before leaving the factory, Another example is when repairs are performed after a problem such as a chip failure.
- the detection and repair mode of the chip in this detection and repair mode, the built-in self-inspection control module 104 can initiate a self-inspection operation according to a configurable algorithm.
- the invalid data unit can be discovered through, for example, an error checking and correction (ECC) mechanism.
- ECC error checking and correction
- the built-in self-check control module 104 finds a failed data unit, it reports the information of the failed data unit to the testing software.
- the test software is usually independent of the device 100, and is used to analyze the data or information obtained during the self-inspection process to determine a repair strategy for the chip.
- the test software delivers the determined repair strategy to the redundancy control register of each component (bank) or sub-component (subbank) of the DRAM chip, and configures the corresponding replacement unit for the failed data unit.
- the redundant control register may be a module provided in the DRAM chip, and is used to repair the DRAM according to the control signal sent by the device 100 through the control logic module 101.
- the redundant control register will control the replacement unit corresponding to the failed data unit to read and write operate. That is to say, the device 100 replaces the failed data unit on-chip, and when the failed data unit is read and written again, the control is converted to read and write operations on the corresponding replacement unit, thereby ensuring the accuracy of chip reading and writing. , But the outside of the chip will not perceive that the failed data unit has been replaced.
- the device 100 when the chip is in the working state (or called working mode), the device 100 does not work, that is, the chip cannot perform detection and repair operations, and when the failed data unit is read and written, only The read and write operations can be adjusted according to the repair strategy previously determined when the repair mode is detected, but if a new failed data unit (that is, a failed data unit that has not been repaired) is read and written in the working mode, there is no Ways to find a matching repair strategy, resulting in read and write errors.
- the control logic module 101 may be used to control the mode of the chip, for example, using a control bus to input instructions to the control logic module 101, so as to switch the detection and repair mode and the working mode of the chip.
- the address registration module 102 can be used to store addresses.
- the built-in self-inspection control module 104 is used to send a self-inspection instruction, perform a self-inspection on the chip, and report the information of the detected failed data unit to the testing software.
- the built-in self-check control module 104 can be configured with algorithms through the configuration bus.
- the read and write data processing module 103 is used to process the self-inspection data from the built-in self-inspection control module 104 and the read data and write data from the chip.
- the redundant storage resources used for failure repair must be evenly distributed in each component (bank) or sub-component (sub bank) of the chip, for example, the data unit configuration for every 2,000 rows 16 redundant rows of replacement units.
- the minimum granularity of redundant replacement can be rows, and each redundant row can have 8 kilobits (bit) of storage bits.
- the device 100 may be arranged on a logic chip of a chip.
- this setting is a hard reservation method, and it is easy to cause some replacement units to be hardly used, resulting in a waste of resources; it is also easy to happen that the replacement units corresponding to some components are not enough, and so on.
- the replacement unit can be dynamically invoked, and therefore will not be subject to the above limitations, that is, there is no need to set up uniformly distributed redundant storage resources in each component of the chip.
- FIG. 2 is a schematic diagram of the application of a solution for repairing a chip provided by an embodiment of the present application.
- the chip may be, for example, a memory chip, and the memory chip may be, for example, a DRAM chip.
- chip repair is implemented by the chip repair device 200.
- the device 200 may include a control module 201, an address management module 202, a self-check module 203, a read-write management module 204, a failure adjustment module 205, and a failure repair module 206.
- the device 200 can be installed in the logic slice of the chip to detect and repair the memory slice of the chip (for example, a DRAM slice).
- the device 200 can also be used to detect and repair other memory chips other than DRAM.
- the control module 201 can be used to set the mode or parameters of the chip, for example, it is assumed that the mode of the chip can be set to include the online detection mode and the working mode. The two modes are introduced below.
- the chip does not process or receive external access. In other words, in the online detection mode, the chip cannot be read or written from the outside.
- the online detection mode is different from the detection and repair mode described in Figure 1.
- the chip In the online detection mode, the chip is in working condition from the outside, which is equivalent to using the gap of the chip to detect the chip. Take the chip offline or even detach it from the original position.
- the detection and repair mode described in Figure 1 usually only appears before the chip leaves the factory or when it is returned to the factory for repair, while the online detection mode described in Figure 2 is performed while the chip is still in use, and there is no need to return to the factory. Maintain.
- the repair of Figure 2 can be called online repair or dynamic repair.
- the chip described in FIG. 2 may also additionally include a detection and repair mode, that is, the chip described in FIG. 2 may also perform the offline repair or static repair described in FIG. 1 on it.
- the chip can use the device 200 to detect and repair the chip.
- the device 200 may obtain the self-check data generated during the self-check process through, for example, a built-in online self-check control module 203, so as to find out whether there is an abnormality in each part of the chip.
- the self-inspection module 203 can initiate a self-inspection operation according to a configurable algorithm, and the failed data unit can be found through the ECC mechanism during the self-inspection process.
- the chip can receive and process accesses such as read and write operations from the outside.
- the chip may not perform detection, but because external access is not continuous, the chip can still be detected with the gap of external access, which is equivalent to switching the chip to online detection mode during the gap of external access.
- the chip can be accessed internally by using the gap of external access, such as internal read and write operations, so that failed data units can be detected.
- the above-mentioned online detection mode and working mode are to facilitate the understanding of the technical solutions of the embodiments of the present application, and such a mode division process does not necessarily exist.
- the mode division can be performed, but the mode division may not be performed.
- the online detection mode can be regarded as the response process when the access from the read and write operations inside the chip is received, and the work The mode is regarded as the response process when the access to read and write operations from the outside of the chip is received.
- control module 201 can be used to switch the mode of the chip.
- the mode is not divided, it is equivalent to using the control module 201 to receive and process internal access to the chip or whether to receive and process external access.
- control For example, it can be set to initiate M internal visits every time N external visits are received, where M and N are both positive integers. For further example, it can be set to initiate 1 internal visit every 4 times external visits are received; you can also set M and N to variable values, for example, set to initiate 2 internal visits every 5 times external visits are received within a period of time. Visits, set to initiate 3 internal visits every time 3 external visits are received in another period of time, and so on.
- the mode of the chip can also be controlled by setting the time interval. For example, it can be set to receive external access in the first time interval, not to receive external access in the second time interval, etc., and further can also set the first time interval and The numerical relationship of the second time interval, and the time interval that can be set to a variable value, etc., will not be listed here.
- control bus can be used to input commands to the control module 201 to realize real-time control of the mode, or configure preset commands in the control module 201, so that the control module 201 can control the mode of the chip, or the control module 201 can control the mode of the chip. Whether the chip receives and/or processes internal and/or external read-write operations and other accesses is controlled.
- the address management module 202 may be used to store the address of the data unit, so as to find the storage content such as the corresponding data of a certain data unit according to the address of the data unit.
- the self-inspection module 203 can be used to initiate a self-inspection request, perform a self-inspection on the chip, and send the information of the detected failed data unit to the failure adjustment module 205, so that the failure adjustment module 205 can perform subsequent operations.
- the information of the detected failed data units can also be reported to the testing software at the same time.
- the test software can be independent of the chip, and can also be used to process the test results of the chip. It should be noted that the self-test module 203 is different from the built-in self-test module 104 shown in FIG. In this case, the chip is tested, and the built-in self-test module 104 needs to put the chip in an offline state before the chip can be tested.
- the self-check module 203 can obtain error detection information and perform operations such as query and update on the failure repair module 206, so that the repair of the failed data unit can be adjusted in real time.
- the self-checking module 203 can run when the chip is online, and can monitor the result of the ECC error detection in real time, and the newly discovered failed data unit can be dynamically added to the failure repairing module 206 to repair it.
- the self-checking module 203 can be configured with an algorithm, the configuration of the algorithm can be a preset configuration, or the algorithm configuration can be set to be variable, and so on.
- the read-write management module 204 can be used to read the self-check data from the self-check module 203, the repair instruction from the failure repair module 206 and/or the data of the repaired failed data unit, as well as the read data and write data from the chip.
- the ECC detection result information of the data is processed.
- the read-write management module 204 may also be used to send the generated error detection information of the failed data unit to the failure adjustment module 205.
- the failure adjustment module 205 can be used to monitor the parameter information of each component of the chip in real time, and determine how to repair the chip according to the acquired parameter information.
- the parameters can include temperature, failure information, refresh cycle, etc., several parameters are introduced below.
- Temperature refers to the temperature of each component of the chip, for example, the real-time temperature of each component of the chip when it is working.
- temperature sensors arranged at various parts of the chip can be used to obtain the temperature.
- the failure information may include the information of the failed data unit, for example, it may include the address of the failed data unit, and may also include the error detection information of the failed data unit.
- the error detection information will be further explained below.
- the error detection information may include, for example, the number of times that the invalid data unit is detected as an error during the detection process, for example, it may be the number of errors detected by the ECC mechanism.
- the error detection information may include the cumulative number of errors detected in the failed data unit, the last time the error was detected, the number of consecutive errors detected in the failed data unit, and so on.
- a failure judgment criterion can be set, that is, an evaluation criterion for judging whether a certain data unit is invalid.
- the number of errors can be used as the criterion to set a threshold of the number of errors. When the number of errors in operations such as reading and writing of a data unit is greater than or equal to the threshold of the number of errors, it is determined that the data unit is invalid. When the number of operation errors such as reading and writing of a data unit is less than the threshold of the number of errors, it is determined that the data unit is normal.
- the cumulative number of read and write operations on data unit A within a period of time before T1 reaches the error threshold, then data unit A can be considered to be invalid at time T1;
- the cumulative number of read and write operations error on the data unit A is less than the error number threshold, and it can be considered that the data unit A is normal at the time T2.
- the data unit A may be repaired, or it may be that the data unit A is automatically restored to non-failure as the environmental parameters change.
- the number of consecutive errors or the number of consecutive correctness of the failed data unit during the detection process can be used to determine whether the data unit is invalid. For example, it can be set to determine that the data unit is invalid when reading and writing to a data unit P times are wrong, and when reading and writing to a data unit Q times are correct, it is determined that the data unit is normal, etc., where P, Q Can be a positive integer.
- the time interval between the time of the last error and the current time can also be combined to determine whether the data unit is invalid. For example, you can set the data unit to be considered invalid when the cumulative number of errors is greater than or equal to 4; but you can increase the setting, when the cumulative number of errors is between 2 and 4, and the time interval between the last error time and the current time When it is less than or equal to 1 minute, the data unit is considered to be invalid.
- the repair operation on the chip may include multiple operations, and may include, for example, repair in the traditional sense, that is, repair in the form of, for example, replacing failed data units; Including the operation of adjusting (changing) other chip parameters, such as adjusting the refresh period of the chip.
- the refresh period refers to the time interval from the end of the last refresh to the next refresh of the chip. Refreshing the chip is equivalent to a charging process, enabling the data on the chip to be "reinforced" to prevent or reduce data loss.
- DRAM can only retain data for a short period of time. In order to retain data, DRAM uses capacitor storage, so it must be refreshed at intervals. If the memory cell is not refreshed in time, the stored information will be lost. Therefore, it is generally believed that the shorter the refresh period is, the less likely the data loss is when other parameters remain unchanged. Conversely, the longer the refresh period is, the greater the probability of data loss is.
- refreshing needs to consume energy and resources, too frequent refreshing will produce unnecessary energy and resource waste, and the refreshing period will be limited by the performance of the chip itself, so the refreshing period cannot be reduced without a lower limit.
- the data of the memory chip is stored in the capacitor in the form of electric charge, so it needs to be constantly refreshed to supplement the electric charge of the capacitor.
- the holding time of the charge in the capacitor will change with the change of temperature. In some cases, it can be understood that the higher the temperature, the easier the capacitor will leak, resulting in faster charge leakage, and the faster the stored data will disappear.
- the refresh cycle is unchanged, it is equivalent to not charging the capacitor in time, causing the data unit to fail. At this time, if the refresh cycle is shortened, the capacitor can be charged in time, so that the data can be refreshed in time, and the data unit is invalidated. No longer expires.
- the failure repair module 206 can be used to control the repair operation of the chip, and can also be used to update the failure information table online.
- the inspection parameters are executed to change the parameters of the chip, such as the refresh cycle, and for example, the repair sequence can be determined according to the self-inspection parameters, and the failure information table can be modified online and in real time according to the obtained self-inspection parameters.
- the failure information table of the failure repair module 206 may be composed of a content addressable memory (CAM) and a static random access memory (SRAM). Among them, the address of each failed data unit can be recorded in the CAM for read and write address matching.
- the SRAM failure information table can record the error detection information of the failed data unit. For example, the SRAM failure information table can record the number of ECC errors detected for each failed data unit and the time of the last detection.
- the failure repair module 206 may be used to receive the failure information update instruction from the failure adjustment module 205, and perform the addition and deletion operations of the failure information table. For example, when the error detection information of a failed data unit changes at a certain time, such as the number of ECC errors and/or the time of the last detection, the failure information in the failure information table can be updated. .
- the self-inspection module 203 may initiate a self-inspection operation request to the control module 201, and the control module 201 responds to the self-inspection request, and performs a self-inspection operation on the DRAM through the control signal, that is, reads and writes data to the DRAM.
- the self-inspection module 203 collects the self-inspection data generated in the self-inspection process, and sends these self-built data to the read-write management module 204, and the read-write management module 204 processes the self-inspection data and generates error detection information Send to the failure adjustment module 205, the failure adjustment module 205 receives the error detection information, determines the repair operation of the chip based on the error detection information and other self-inspection parameters such as the temperature and refresh cycle obtained from other modules, and instructs the failure repair The module 206 executes the corresponding repair operation.
- the failure repair module 206 changes the refresh cycle and other parameters according to the instructions from the failure adjustment module 205 or sends a repair instruction or repair data corresponding to the failed data unit to the read-write management module 204, so that the read The write management module 204 performs a corresponding repair operation on the failed data unit.
- FIG. 3 is a schematic flowchart of a method for repairing a chip provided by an embodiment of the present application.
- the chip may be, for example, a memory chip. The steps in Figure 3 are described below.
- the self-inspection module 203 may initiate a self-inspection operation request to the control module 201 according to a pre-configured algorithm, so that the device 200 can perform a self-inspection on the chip to obtain the self-inspection parameters of each component of the chip or each failed data unit .
- the self-check parameters may include temperature and failure information, where the failure information may include the address of the failed data unit and error detection information.
- each data unit has its own temperature sensor. Therefore, a certain number of data units share the temperature detected by a temperature sensor. For example, assuming that the chip includes 4 components, and each component is equipped with a temperature sensor, the temperature of each component can be used to represent the temperature of each data unit in each component.
- the frequency of obtaining temperature and the frequency of obtaining failure information may be different, that is, failure information may be obtained 100 times in a period of time, but temperature data only 10 times, and so on.
- failure information may be obtained 100 times in a period of time, but temperature data only 10 times, and so on.
- each self-check may detect a failed data unit, but in comparison, the temperature often does not change suddenly in a short time, so in a short time During the interval, the temperature can be obtained infrequently.
- the self-check parameters may also include other parameters such as a refresh period.
- a refresh period it should be noted that during the use of the chip, the refresh cycle of the chip often does not need to change frequently, or it can be considered that the refresh cycle of the chip rarely changes. Therefore, if the refresh cycle of the chip is re-acquired every time it is tested, it is a bit redundant. Omitting this step can also reduce power consumption and reduce the resources required by the chip to execute the process.
- the value of the changed refresh period can be saved synchronously when the chip changes the refresh period, and the refresh period only needs to be read once at the beginning of the chip's work, and the refresh period change information is not received ( When a new refresh cycle is used, it means that the refresh cycle remains unchanged and does not need to be read again, which can also save corresponding resource consumption.
- the relevant module sends the refresh cycle to, for example, the control module 201.
- the control module 201 obtains the refresh cycle and sends the refresh cycle to the failure adjustment.
- Module 205 and in the process of self-checking or normal operation of subsequent chips, the control module 201 will acquire and send the new refresh period after the change only when the refresh period is changed.
- the acquisition time of all parameters there is no limitation on the acquisition time of all parameters. That is to say, in the detection (self-test) process of the chip initiated at the current moment, only one or more of the above-mentioned respective test parameters can be obtained, for example, it can be obtained during the detection process at the current moment.
- the temperature information of each component of the chip and/or the failure information of each component of the chip can be obtained.
- the changed refresh period can be saved synchronously when the chip changes the refresh period, and the new refresh period is acquired during the first self-check after the refresh period is changed, and the self-check is performed when the refresh period is unchanged. When it is time, it is not necessary to repeatedly obtain the refresh cycle.
- the time interval between several detections is small, and the temperature usually does not change suddenly. Therefore, it is not necessary to obtain temperature information during these several detections, and so on. That is to say, when acquiring the self-check parameters of the chip, it may be acquired partially or completely, and when and how to acquire it, it may be implemented by the control module 201.
- each component can be divided into multiple groups, and the detection can be performed in rounds according to groups, etc., which will not be repeated here.
- the chip can also be set to detect some or all of the data units.
- the data units can be divided into multiple groups, and the detection can be performed in rounds, etc., which will not be repeated here.
- the chip can be repaired in different situations according to the acquired self-check parameters of the chip.
- the repair operation may include multiple situations, which will be described below with reference to FIG. 4 as an example.
- FIG. 4 is a schematic flowchart of a repair operation of a chip according to a self-check parameter according to an embodiment of the present application.
- the chip may be, for example, a memory chip, and the memory chip may be, for example, a DRAM chip.
- the terminology involved in Figure 4 is introduced.
- Repair ability It is assumed that when the actual repair is performed, a repair method similar to the traditional replacement is used. That is to say, the read and write operations on the failed data unit are replaced internally with the read and write operations on the corresponding replacement unit, and such replacement is not perceived outside the chip.
- this repair capability is limited, because a storage module for repairing data needs to be set in the chip, and other resources may need to be allocated for performing the above-mentioned repairing process, all of which make it impossible to repair data units without restrictions. Therefore, the repair ability of the chip can be evaluated to a certain extent. For example, the repair ability can be evaluated by the ability parameters of the failed data units that can be repaired at the same time.
- the maximum number of repair data replacement units represents the repair capability, or the storage space required by the failed data or the storage space of the repair data can be used to represent the repair capability. For example, suppose the maximum repair data can store 4 megabytes (MB) ), you can use 4MB to represent the repair capability.
- MB megabytes
- the first temperature threshold When the temperature is greater than or equal to the first temperature threshold, it can be considered that the repair capability is far from sufficient to deal with the failure situation that may occur. Under normal circumstances, under the premise that other parameters remain unchanged, it can be considered that the higher the temperature, the higher the possibility of data unit failure. For example, it can be determined through experiments, simulations or experience that when the temperature is higher than a certain value, the failure The probability of occurrence of data units is very high. At this time, the number of failed data units is far greater than the maximum number supported by the repair capability.
- the second temperature threshold may be less than the first temperature threshold.
- the number of failed data units may be greater than the maximum number supported by the repair capability, or the repair capability may not be exceeded. At this time, certain selections or certain rules can be given to give priority to some invalid data units.
- the temperature is less than or equal to the second temperature threshold, the number of failed data units generally does not exceed the repair capability. At this time, operations such as selecting the failed data unit may not be performed, which is equivalent to repairing as many failures as they occur. Of course, it can also be set to give priority to processing part of the invalid data unit, which does not affect the effect of the repair.
- the temperature threshold is equal to the temperature threshold, it can be processed as if it is greater than the threshold, or it can be processed as if it is less than the threshold.
- the above-mentioned “greater than or equal to” and “less than or equal to” may not include “equal to”.
- the above-mentioned first temperature threshold and the second temperature threshold do not necessarily have to be applied; in different embodiments of the present application, only one temperature threshold may be used, or both may be used at the same time.
- the method provided above can be used to obtain temperature information using a temperature sensor, which will not be repeated here.
- step 402. Determine whether the temperature obtained in step 401 is greater than or equal to the first temperature threshold, and execute step 403 when the determination result is "yes”, and execute step 404 when the determination result is "no". It should be noted that only the case when the determination result is "yes” may be considered, or only the case when the determination result is "no" may be considered, or both may be considered.
- the refresh period can be adjusted to make invalid data due to temperature rise.
- the unit no longer fails. That is to say, after the temperature is acquired, it can be judged whether the temperature is greater than or equal to the first temperature threshold.
- the refresh cycle can be shortened, so that the invalid data unit caused by the temperature rise is not Invalidated again. In this case, the temperature is higher than the first temperature threshold.
- many failed data units in the various components of the chip may be caused by temperature. This failure can be solved by refreshing. Therefore, the refresh of the chip can be shortened.
- the cycle enables the data unit to be refreshed in time, thereby solving the failure.
- the more serious failure situation caused by the excessive temperature can be effectively solved by adjusting the parameters (for example, shortening the refresh cycle).
- the data of the memory chip is stored in the capacitor in the form of electric charge, so it needs to be constantly refreshed to supplement the electric charge of the capacitor.
- the holding time of the charge in the capacitor will change with the change of temperature. In some cases, it can be understood that the higher the temperature, the easier the capacitor will leak, resulting in faster charge leakage, and the faster the stored data will disappear.
- the refresh cycle is unchanged, it is equivalent to not charging the capacitor in time, causing the data unit to fail. At this time, if the refresh cycle is shortened, the capacitor can be charged in time, so that the data can be refreshed in time, and the data unit is invalidated. No longer expires.
- the first temperature threshold can be determined by comparing the number of failed data units at different temperatures. For example, assuming that the number of failed data units is 20 when the temperature is 30°C, and the number of failed data units reaches 200 when the temperature is 75°C, it can indicate that the increase in temperature causes a large proportion of failures, and it is very likely The maximum number of repairs that has far exceeded the repair capability, shortening the refresh cycle can effectively solve this more serious failure situation.
- the first temperature threshold can also be determined by comparing the difference between the number of failed data units at different temperatures and the number of failed data units at the factory.
- the first temperature threshold can also be determined by comparing the difference between the number of failed data units at different temperatures and the maximum number of repairs corresponding to the maximum repair capability.
- the first temperature threshold may also be determined according to the proportion of failed data units in all failed data units caused by temperature at different temperatures. For example, suppose that when the temperature is 50°C, the aforementioned proportion is 30%, and when the temperature is 75°C, the aforementioned proportion is 80%. According to statistical data, 75°C is set as the first temperature threshold. In addition, you can also set the corresponding temperature as the first temperature threshold when the above-mentioned proportion reaches a certain value. For example, you can set the corresponding temperature as the first temperature threshold when the above-mentioned proportion is greater than or equal to 60%. .
- the first temperature threshold may also be determined according to the upper temperature limit of the normal operation of the chip. For example, suppose it is known that the upper temperature limit of the normal operation of the chip is 85°C, for example, the first temperature threshold can be set to a value between 70°C and 80°C to try to ensure that the chip can still work normally through repair . It should be noted that the temperature parameters such as the normal working temperature range of the chip, the upper limit of the normal working temperature, and the lower limit of the normal working temperature all depend on the performance of the chip itself. Temperature parameters may be different. For example, the upper limit of the normal working temperature of the chip may also be 90°C, 82°C, 100°C, 105°C, etc., or other upper temperature limits or lower temperature limits. This is not limited.
- each time is used to evaluate the repair capability of the chip. For further example, suppose that each chip can repair up to 2,000 failed data units at the same time. Assuming that the first temperature threshold is 70°C, when the actual temperature obtained at a certain moment is 75°C, the operation of shortening the refresh cycle can be performed to solve the failure. The problem.
- the experimental data shows that when the temperature is higher than 80°C, at least 20,000 data units of the chip will fail, which is far beyond the repair ability of repairing 2,000 failed data units at the same time
- the temperature information obtained shows
- the temperature of most parts of the chip is around 80°C, there is no way to repair 20,000 data units at this time, and the refresh cycle of the chip can be shortened.
- the original refresh cycle is 2 milliseconds, and the refresh cycle is now set. Modified to 1 millisecond, that is to say, by refreshing the data in time, the data will not disappear, and there is no need to repair the chip at this time.
- the temperature of each component of the obtained chip drops significantly, for example, it drops to 30°C, and the experimental data shows that when the temperature is around 30°C, the probability of failure of each component of the chip is very low. , Then the shortened refresh period can be adjusted to the original refresh period, or the originally shortened refresh period can be appropriately increased.
- the data unit failure may be caused by temperature or other factors, such as physical damage.
- the temperature usually does not change suddenly, so for failures that are not caused by temperature (that is, other factors), it may have been found and repaired during the self-test before the temperature reaches the first threshold; and if the temperature rises Failures caused by other factors that occur at the same time can still be found and repaired in the follow-up self-inspection.
- data units that are still invalid can be detected in subsequent self-checks, and for example, such invalid data units can be repaired further based on the invalidation information. That is to say, for example, a certain selection can be made for failed data units that are still invalid after shortening the refresh period, and some of the failed data units can be selected for priority repair processing, or the failed data units can be repaired in a certain order. For example, it can be determined to repair part or all of this part of the failed data unit according to the repair capability and the actual number of failed data units. For example, the number of repairs may not be considered, and the failed data units can be repaired in the order of the number of errors, which means priority. For repairing failed data units with a large number of errors, for example, it is possible to set a priority to repair failed data units in a certain area of the chip.
- part of the failed data units can also be repaired according to the failure information.
- the failed data unit with the most errors in the failure information can be repaired first, and for example, the failed data unit of certain parts of the chip or a specific area of the chip can be repaired first.
- the above-mentioned temperature-induced failure can be understood as a kind of "soft failure”, that is, it is a failure that can be eliminated by changing the chip parameters.
- the embodiment of the present application mainly takes the "soft failure” caused by temperature as Examples are introduced, but for other "soft failures", the same method (adjusting parameters) can also be used to make the failed data unit no longer invalid, so I won't repeat it here.
- the shortened refresh period may be adjusted to the original refresh period according to the temperature, or the originally shortened refresh period may be appropriately increased.
- step 404 and subsequent steps can be performed.
- step 404 Determine whether the temperature is greater than the second temperature threshold, and perform step 405 when the determination result is "yes”, and perform step 406 when the determination result is "no".
- step 405. It is judged whether the failure condition exceeds the repair capability, and step 407 is executed when the judgment result is "Yes", and step 406 is executed when the judgment result is "No".
- the temperature when the temperature is less than the first temperature threshold, it can be further judged whether the temperature is greater than the second temperature threshold, and corresponding operations are performed. For example, when the temperature is greater than the second temperature threshold and less than the first temperature threshold, it means that the failed data unit to be repaired may not be repaired because the repair capability exceeds the repair capability. Therefore, it can be detected in the self-test process. Make certain selections for the failed data units, and select part of the failed data units to be repaired first, or repair the failed data units in a certain order.
- the determination of the relationship with the second temperature threshold may not be performed, but when the temperature is less than the first temperature threshold, it is determined to repair some or all of the failed data units according to the failure information and/or temperature, or according to the failure data The repair sequence of the unit is repaired.
- the failure information of the failed data unit may be used to determine the repair sequence for the failed data unit, and when a failure occurs, the failed data unit may be repaired according to the repair sequence.
- the repair order of failed data units can be sorted according to the error detection information, such as prioritizing failed data units with a large number of errors, or prioritizing the failed data units that are closest to the current time when the last error occurred, and For example, priority is given to the invalid data unit with the largest number of consecutive errors.
- the above-mentioned number of errors, the time of the last error, and the number of consecutive errors can all be read from the error detection information.
- the failure information and the repair capability may be combined to comprehensively determine the failed data unit to be repaired.
- the failed data unit may be repaired according to the repair sequence, and all the failed data units may also be repaired when the failure condition does not exceed the repair capability.
- the repair capability is represented by the maximum number of failed data units that can be processed at the same time, it is possible to first determine whether the number of failed data units at the current moment is greater than the maximum repairable number supported by the repair capability, and then perform corresponding operations.
- the same or similar method as above can be used to select the priority processing from the failed data units to be processed according to the failure information.
- Invalid data unit for example, priority is given to the invalid data unit that has a large number of errors or the time of the last error is relatively close.
- the judgment result is that the number of failed data units at the current moment is less than or equal to the maximum repairable number supported by the repair capability, it indicates that the repair capability is sufficient. At this time, operations such as selecting failed data units are not necessary, which is equivalent to how many failed data units are present. The number of failed data units can be repaired, but it should be understood that at this time, part of the failed data units can also be set to be processed first, and the repair effect will not be affected.
- the failure information of a certain data unit at a certain time indicates that the data unit has only experienced an error, and the time of error is far from the current time
- the repair operation of the chip may include the repair of the traditional repair method and other operations.
- the repairing of failed data units, shortening the refresh period, and extending the refresh period mentioned above are all repair operations.
- Repairing a failed data unit may refer to repairing using traditional repairing methods such as replacement, and the operation of the refresh cycle is equivalent to an operation example of changing the chip parameters, that is, for example, the above-mentioned adjusting refresh cycle can be used.
- the operation of adjusting (changing, including shortening, extending, restoring, and resetting) the refresh period all belong to the repair operation described in the embodiment of the present application. Both adjusting the refresh period and changing the refresh period refer to changing the refresh period.
- the refresh cycle can be shortened to 1 millisecond at time T2 after time T1, or the refresh cycle can be extended to 4 milliseconds at time T2 after time T1.
- the refresh period at T1 is shortened at T2
- the refresh period can be restored to 2 milliseconds at T3 after T2.
- the refresh period can also be adjusted according to the second temperature threshold.
- the shortened refresh period can be extended or restored when the temperature is less than or equal to the second temperature threshold. Assuming that the refresh period is 2 milliseconds before being shortened again, when the temperature is detected to be greater than or equal to the first temperature threshold, the refresh period is shortened to 1 millisecond. After the temperature drops, for example, when the temperature is detected to be less than or equal to the second temperature threshold At this time, the refresh cycle is extended to 1.5 milliseconds, or restored to 2 milliseconds, or extended to 3 milliseconds, and so on.
- the repair operations described in the embodiments of the present application include repairing failed data units, changing chip parameters, making judgments based on the set temperature threshold, and performing different operations.
- the temperature is less than or equal to the second temperature threshold.
- the second temperature threshold it is equivalent to that the repair capability is sufficient to repair the possible failed data units at this time.
- it can also be set as the priority processing part of the invalid data unit, which does not affect the repair effect.
- the failed data unit to be repaired can be selected from the failed data units to be repaired according to the failure information.
- the refresh period can be appropriately extended. That is to say, assuming that in the detection process at the previous moment, the temperature is greater than or equal to the first temperature threshold, an operation such as shortening the refresh period is adopted, so that the refresh period is shortened.
- the detection process at the current moment it is known that the temperature at the current moment is less than or equal to the second temperature threshold, and the shortened refresh period can be appropriately extended or the refresh period can be restored to the value before the shortening.
- the temperature obtained in step 401 can be multiple temperatures of different regions (components) of the chip, so in steps 402-407, the temperature of each region can be performed separately.
- the temperature of each area can also be considered comprehensively. For example, if it is found that the temperature of some areas of the chip is greater than the first temperature threshold during a certain self-test, and the temperature of some areas is less than the first temperature threshold, you can The area where the temperature is greater than the first temperature threshold is operated to shorten the refresh cycle, while for the area where the temperature is less than the first temperature threshold, part or all of the failed data units are repaired according to the failure information and/or temperature. In this case, since the area where the refresh period is shortened does not occupy other repair resources, the repair resources can be preferentially used in the area where the repair temperature is less than the first temperature threshold.
- Figure 4 shows an example of a judgment process that can be performed, but there are many other situations.
- first of all there is no restriction on the order of the two temperature thresholds, and they can be judged at the same time or sequentially.
- multiple temperature threshold judgments can be performed at the same time, for example, it can be judged at the same time whether the temperature is greater than or equal to the first temperature threshold and whether it is less than or equal to the second temperature threshold.
- only one or more temperature thresholds can be judged. For example, only the first temperature threshold can be considered, or only the second temperature threshold can be considered, and the first temperature threshold and the second temperature threshold can also be considered at the same time. As long as it can be executed logically, it is within the scope described in the embodiments of the present application. It should also be understood that other temperature thresholds can also be set. The following describes some of the possible situations with reference to Figs. 5 to 8.
- the chip may be, for example, a memory chip, and the memory chip may be, for example, a DRAM chip.
- the process (a) in FIG. 5 includes steps 501a to 505a, which will be introduced separately below.
- step 502a Determine whether the temperature obtained in step 501a is less than or equal to the second temperature threshold, and execute step 504a when the determination result is "yes".
- step 503a Determine whether the temperature obtained in step 501a is greater than or equal to the first temperature threshold, and execute step 505a when the determination result is "yes".
- step 502a and step 503a are independent of each other, and the two may be executed at the same time or at different times, and the execution order is not sequential.
- step 504a the proper extension of the refresh period in step 504a does not assume that the refresh period has been shortened, that is, even if the refresh period is not shortened before step 504a is performed, step 504a can still be performed.
- the relationship between the temperature and the first temperature threshold and the relationship between the temperature and the second temperature threshold can be judged.
- the temperature is greater than or equal to the first temperature threshold.
- the refresh cycle can be appropriately extended, especially if the previous refresh cycle is short, it can be repaired without affecting the repair by properly extending it.
- reduce the resources occupied by refresh it should be noted that in this judgment, it is considered that when the temperature is greater than or equal to the first temperature threshold, the possibility of failure is high and the repair capability is insufficient. It is considered that the failure probability is relatively higher when the temperature is less than or equal to the second temperature threshold. Low, repair capacity is usually sufficient.
- the process (b) in FIG. 5 includes steps 501b to 505b, which will be introduced separately below.
- step 502b Determine whether the temperature obtained in step 501b is greater than or equal to the first temperature threshold.
- step 503b is executed, and when the determination result is "No”, step 504b is executed.
- step 504b Determine whether the temperature obtained in step 501b is less than or equal to the second temperature threshold, and execute step 505b when the determination result is "yes".
- step 502b is executed first, and then step 504b is executed.
- FIG. 6 includes steps 601 to 604, which will be introduced separately below.
- step 602 Determine whether the temperature obtained in step 601 is greater than or equal to the first temperature threshold. When the determination result is "Yes”, step 603 is executed, and when the determination result is "No”, step 604 is executed.
- the relationship between the temperature and the first temperature threshold can be judged.
- the refresh cycle is shortened; and when the temperature is less than the first temperature threshold,
- the failure information and/or temperature it is determined to repair part or all of the failed data units.
- the failed data unit to be repaired can be determined only based on the failure information, or it can be further determined based on the temperature to determine whether the temperature is less than the second temperature threshold.
- the temperature is less than the second temperature threshold, all failed data units are repaired.
- the failed data unit to be repaired can be determined comprehensively based on the failure information and temperature.
- part or all of the failed data units can be selected for repair. It should be noted that whether it is determined to repair some of the failed data units or all the failed data units is determined to be repaired, it can be repaired according to the repair order, of course, it can also be repaired out of the repair order, for example, it can be repaired at the same time.
- the failed data unit to be repaired For the introduction of repairing according to the repairing sequence, please refer to the related introduction of Figure 7.
- the relationship between the temperature and the second temperature threshold can be determined first, so as to determine the failed data unit to be repaired.
- Repair all the failed data units, and when the temperature is greater than the second temperature threshold use the failure information to determine the failed data units to be repaired, or the failed data units can be repaired according to the repair order, and so on.
- FIG. 7 includes steps 701 to 704, which will be introduced separately below.
- step 702. Determine whether the temperature obtained in step 701 is greater than or equal to the first temperature threshold, and perform step 703 when the determination result is "yes", and perform step 704 when the determination result is "no".
- the relationship between the temperature and the first temperature threshold is judged.
- the refresh cycle is shortened; and when the temperature is less than the first temperature threshold, you can Repair the failed data unit according to the repair sequence.
- part or all of the failed data units may be repaired according to the repair sequence of the failed data units, or part or all of the failed data units may be repaired according to the failure information and repair capability. It is also possible to further determine to repair some or all of the failed data units according to the temperature. The following is an example.
- the repair order of the failed data units can be sorted according to the failure information, for example, priority can be used to repair data units with a large number of errors, or priority processing of failed data units in certain specific areas or specific components, etc., or Give priority to invalid data units that have only recently experienced errors, etc.
- the failed data unit can be repaired according to the determined repair sequence.
- the repair sequence can also be determined in combination with temperature. For example, when the acquired temperature includes multiple temperature values of different parts of the chip, a failure data unit that preferentially repairs the corresponding lower temperature part can be set.
- the repairing operation according to the repairing sequence can be combined with other operations, or can be performed independently, that is, when the chip is repaired according to the self-check parameters, the repairing can be performed according to the repairing sequence.
- FIG. 8 includes steps 801 to 803, which will be introduced separately below.
- step 802. Determine whether the temperature acquired in step 801 is less than or equal to the second temperature threshold, and execute step 803 when the determination result is "yes".
- step 803 is executed to repair all failed data units. In other words, determine the relationship between the temperature and the second temperature threshold.
- the temperature is considered to be low, and the repair ability will not be insufficient under normal circumstances, so it can be set to repair all failures Data units without having to consider which failed data units should be repaired first. At this time, all failed data units can also be repaired according to the repair sequence.
- the repair can refer to a repair method that uses a traditional replacement method to repair, which is one of the repair operations described above. operate.
- the read-write management module 204 can be used to copy the write data to the failure repair module 206 and store it in the corresponding repair data storage unit.
- the failure repair control module 204 can be used to read the data in the corresponding repair data storage unit and send it to the read-write management module 204, and the read-write management module 204 sends the read data to the external data bus after data replacement. The outside of the chip does not perceive the repair and replacement operation of the failed data unit.
- a cooling device can be set to reduce the temperature, thereby reducing the possibility of failure. So that the number of failed data units that may appear can be within the maximum range that the repair capability can support.
- the judgment standard of the failed data unit can also be adjusted according to the self-check parameters of the chip. This is equivalent to setting the criterion to be relatively strict when the probability of failure is very low, and setting the criterion to be relatively loose when the probability of failure is high.
- the probability of failure when the temperature is low, the probability of failure is low, and it can be set to be considered as a failure if at least 2 errors occur, and when the temperature is high, the probability of failure is higher, and it can be set to have at least 4 errors. It is regarded as invalid, or it can be set to be invalid if there are at least 2 errors and the time interval between the time of the last error and the current time is less than a second, and a is a real number. It should be understood that the above is only an example, and there is no limitation on the numerical value and the judgment condition.
- FIG. 9 is a schematic block diagram of an apparatus for repairing a chip according to an embodiment of the present application.
- the chip may be, for example, a memory chip, and the memory chip may be, for example, a DRAM chip.
- the device 1000 for repairing a chip shown in FIG. 9 includes an acquiring unit 1001 and a processing unit 1002.
- the device 1000 may be used to execute each step of the method for repairing a chip in the embodiment of the present application.
- the acquiring unit 1001 may be used to execute step 301 in the method shown in FIG. 3
- the processing unit 1002 may be used to execute step 302 in the method shown in FIG.
- the obtaining unit 1001 may be used to obtain self-check parameters of the chip, and the self-check parameters may include information such as temperature and failure information.
- the failure information may include the address of the failed data unit and the error detection information of the failed data unit.
- the error detection information may include any of the number of errors of the data unit, the time of the last error, the number of consecutive errors, etc. A variety of information.
- the processing unit 1002 may be configured to perform a repair operation on the chip according to the acquired self-check parameters. For example, when the temperature is high, shorten the refresh cycle so that some or all of the failed data units will not fail. For another example, when the number of failed data units is greater than the maximum processable number of the processing unit 1002, the failed data units to be repaired can be selected from the failed data units to be repaired according to the failure information, that is, the repair strategy is adjusted according to the failure information. For another example, when the number of failed data units is less than or equal to the processing unit 1002 and the temperature is low, all detected failed data units can be repaired, and so on.
- the acquiring unit 1001 may include the self-checking module 203 of the device 200 shown in FIG. 2, that is, the function of the self-checking module 203 may be realized;
- the module 205 obtains self-check parameters such as temperature, failure information, and refresh period.
- the processing unit 1002 may include the failure repair module 206 and the read-write management module 204 of the apparatus 200 shown in FIG. 2, and the processing unit 1002 may also include a control module 201 and an address management module 202.
- the device 1000 may further include a storage unit for storing various types of data such as self-check parameters.
- the storage unit may be independent of the acquisition unit 1001 and the processing unit 1002, or it may be integrated in the processing unit 1002, for example, it may be integrated in the processing unit 1002.
- Failure adjustment module 205 may be independent of the acquisition unit 1001 and the processing unit 1002, or it may be integrated in the processing unit 1002, for example, it may be integrated in the processing unit 1002.
- the device 1000 may be provided in a logic component of the chip.
- An embodiment of the present application also provides a chip, which includes any device for repairing a chip provided in the embodiment of the present application.
- An embodiment of the present application also provides a computer-readable storage medium on which an instruction is stored, and when the instruction is executed, the method for repairing a chip in the foregoing method embodiment is executed.
- the embodiment of the present application also provides a computer program product containing instructions that, when executed, execute the method for repairing the chip in the above method embodiment.
- the disclosed system, device, and method can be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
Landscapes
- Engineering & Computer Science (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Computer Hardware Design (AREA)
- For Increasing The Reliability Of Semiconductor Memories (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
La présente demande concerne un procédé et un dispositif de réparation d'une puce de mémoire, se rapportant au domaine des ordinateurs. Ledit procédé fait appel aux étapes suivantes : l'acquisition de paramètres d'auto-détection d'une puce, et la réalisation d'une opération de réparation sur la puce selon les paramètres d'auto-détection, les paramètres d'auto-détection comprenant des informations de température et de défaillance, et les informations de défaillance comprenant en outre l'adresse d'une unité de données défaillante et des informations de détection d'erreur de l'unité de données défaillante. La réalisation d'une opération de réparation correspondante sur une puce selon des paramètres d'auto-détection peut permettre une meilleure fonction de réparation, améliorant l'effet de réparation, améliorant ainsi la robustesse de la puce.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202080091993.1A CN114930457A (zh) | 2020-03-11 | 2020-03-11 | 修复存储芯片的方法和装置 |
PCT/CN2020/078839 WO2021179213A1 (fr) | 2020-03-11 | 2020-03-11 | Procédé et dispositif de réparation de puce de mémoire |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/078839 WO2021179213A1 (fr) | 2020-03-11 | 2020-03-11 | Procédé et dispositif de réparation de puce de mémoire |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021179213A1 true WO2021179213A1 (fr) | 2021-09-16 |
Family
ID=77671684
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/078839 WO2021179213A1 (fr) | 2020-03-11 | 2020-03-11 | Procédé et dispositif de réparation de puce de mémoire |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114930457A (fr) |
WO (1) | WO2021179213A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116304956A (zh) * | 2023-05-15 | 2023-06-23 | 济宁市质量计量检验检测研究院(济宁半导体及显示产品质量监督检验中心、济宁市纤维质量监测中心) | 一种芯片温度异常在线检测方法 |
CN117711473A (zh) * | 2024-02-06 | 2024-03-15 | 南京扬贺扬微电子科技有限公司 | 一种基于存储器设备的自检数据管理系统及方法 |
CN118427026A (zh) * | 2024-07-02 | 2024-08-02 | 深圳市大族半导体测试技术有限公司 | 用于电力通信芯片性能调优的修调设备控制方法及系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060013052A1 (en) * | 2004-07-15 | 2006-01-19 | Klein Dean A | Method and system for controlling refresh to avoid memory cell data losses |
US7233538B1 (en) * | 2004-08-02 | 2007-06-19 | Sun Microsystems, Inc. | Variable memory refresh rate for DRAM |
CN102272849A (zh) * | 2008-12-30 | 2011-12-07 | 美光科技公司 | 可变存储器刷新装置和方法 |
CN104795109A (zh) * | 2014-01-22 | 2015-07-22 | 南亚科技股份有限公司 | 动态随机存取存储器与选择性地执行刷新操作的方法 |
-
2020
- 2020-03-11 CN CN202080091993.1A patent/CN114930457A/zh active Pending
- 2020-03-11 WO PCT/CN2020/078839 patent/WO2021179213A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060013052A1 (en) * | 2004-07-15 | 2006-01-19 | Klein Dean A | Method and system for controlling refresh to avoid memory cell data losses |
US7233538B1 (en) * | 2004-08-02 | 2007-06-19 | Sun Microsystems, Inc. | Variable memory refresh rate for DRAM |
CN102272849A (zh) * | 2008-12-30 | 2011-12-07 | 美光科技公司 | 可变存储器刷新装置和方法 |
CN104795109A (zh) * | 2014-01-22 | 2015-07-22 | 南亚科技股份有限公司 | 动态随机存取存储器与选择性地执行刷新操作的方法 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116304956A (zh) * | 2023-05-15 | 2023-06-23 | 济宁市质量计量检验检测研究院(济宁半导体及显示产品质量监督检验中心、济宁市纤维质量监测中心) | 一种芯片温度异常在线检测方法 |
CN116304956B (zh) * | 2023-05-15 | 2023-08-15 | 济宁市质量计量检验检测研究院(济宁半导体及显示产品质量监督检验中心、济宁市纤维质量监测中心) | 一种芯片温度异常在线检测方法 |
CN117711473A (zh) * | 2024-02-06 | 2024-03-15 | 南京扬贺扬微电子科技有限公司 | 一种基于存储器设备的自检数据管理系统及方法 |
CN117711473B (zh) * | 2024-02-06 | 2024-05-14 | 南京扬贺扬微电子科技有限公司 | 一种基于存储器设备的自检数据管理系统及方法 |
CN118427026A (zh) * | 2024-07-02 | 2024-08-02 | 深圳市大族半导体测试技术有限公司 | 用于电力通信芯片性能调优的修调设备控制方法及系统 |
CN118427026B (zh) * | 2024-07-02 | 2024-09-20 | 深圳市大族半导体测试技术有限公司 | 用于电力通信芯片性能调优的修调设备控制方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN114930457A (zh) | 2022-08-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021179213A1 (fr) | Procédé et dispositif de réparation de puce de mémoire | |
JP4939234B2 (ja) | フラッシュメモリモジュール、そのフラッシュメモリモジュールを記録媒体として用いたストレージ装置及びそのフラッシュメモリモジュールのアドレス変換テーブル検証方法 | |
US8015438B2 (en) | Memory circuit | |
CN101996689B (zh) | 存储器错误处理方法 | |
US20180158535A1 (en) | Storage device including repairable volatile memory and method of operating the same | |
US7409580B2 (en) | System and method for recovering from errors in a data processing system | |
US20120131382A1 (en) | Memory controller and information processing system | |
US10803972B2 (en) | Flash memory module, storage system, and method of controlling flash memory | |
US20150026537A1 (en) | Memory device with over-refresh and method thereof | |
US20070113121A1 (en) | Repair of semiconductor memory device via external command | |
CN112667445A (zh) | 封装后的内存修复方法及装置、存储介质、电子设备 | |
EP4109271A2 (fr) | Puce de mémoire avec un compte d'activation par rangée ayant une protection par code de correction d'erreur | |
TW201626398A (zh) | 測試及識別記憶體裝置之系統及方法 | |
TW202133178A (zh) | 記憶體系統 | |
CN110990187B (zh) | 一种内存巡检方法及系统 | |
CN111552500B (zh) | 一种适用于星载fpga的刷新方法 | |
CN117971539A (zh) | 一种内存故障处理方法、计算设备及管理平台 | |
WO2016101177A1 (fr) | Procédé de détection de mémoire à accès aléatoire de dispositif informatique et dispositif informatique | |
JP2002109895A (ja) | 半導体記憶装置 | |
CN110659150A (zh) | 微控制单元内存的检测方法以及相关装置 | |
CN113608911B (zh) | 面向SoC中ScratchPad存储器的自愈方法 | |
US20220350500A1 (en) | Embedded controller and memory to store memory error information | |
EP4246329B1 (fr) | Procédé et appareil de correction d'erreur | |
US10922023B2 (en) | Method for accessing code SRAM and electronic device | |
CN114764596A (zh) | 延长硬盘寿命方法、装置、计算机设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20924615 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20924615 Country of ref document: EP Kind code of ref document: A1 |