WO2021159360A1 - 一种存储器故障修复方法及装置 - Google Patents

一种存储器故障修复方法及装置 Download PDF

Info

Publication number
WO2021159360A1
WO2021159360A1 PCT/CN2020/074986 CN2020074986W WO2021159360A1 WO 2021159360 A1 WO2021159360 A1 WO 2021159360A1 CN 2020074986 W CN2020074986 W CN 2020074986W WO 2021159360 A1 WO2021159360 A1 WO 2021159360A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
storage
memory
bus
storage space
Prior art date
Application number
PCT/CN2020/074986
Other languages
English (en)
French (fr)
Inventor
张先富
王正波
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/074986 priority Critical patent/WO2021159360A1/zh
Priority to CN202080078352.2A priority patent/CN114730607A/zh
Publication of WO2021159360A1 publication Critical patent/WO2021159360A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair

Definitions

  • This application relates to the field of memory technology, and in particular to a method and device for repairing memory failures.
  • Memory is the main component for storing information in various electronic devices, and can be used to store information such as operation codes and data files.
  • memory is the main component for storing information in various electronic devices, and can be used to store information such as operation codes and data files.
  • the present application provides a memory fault repair method and device, which are used to implement fault repairs of different granularities or different fault ranges, and improve the fault repair function of the memory.
  • a method for repairing a memory failure is provided, which is applied to a storage system including control logic and a memory, including: detecting a storage unit in the memory to obtain at least one failure; and analyzing at least one failure based on different storage granularities to determine The fault range of the memory; the memory is repaired according to the fault repair strategy corresponding to the fault range in the preset fault repair strategy.
  • control logic detects the faults in the memory, and performs analysis operations such as classification and statistics on the detected faults to obtain the fault range, and then repairs the memory according to the fault repair strategy corresponding to the fault range in the preset fault repair strategy. Therefore, when the memory fails, the memory failures with different granularities or different fault ranges can be effectively repaired in time to ensure the normal use of the memory, thereby improving the accuracy and reliability of the memory, and improving the performance of the memory at the same time. Service life.
  • detecting the storage unit in the memory to obtain at least one fault includes: generating at least one set of read and write operations according to a fault detection algorithm.
  • the fault detection algorithm includes the Checkerboard algorithm. Method), Gallop algorithm (running method), March algorithm (process method), MSCAN algorithm (all 0 all 1 algorithm) and butterfly algorithm (butterfly method), etc.; read and write the memory based on at least one set of read and write operations
  • the storage unit obtains at least one failure, and each failure can be used to indicate the storage unit that has failed and the type of failure.
  • one or more failures of the storage unit in the memory can be detected through the failure detection algorithm, so that the accuracy and efficiency of failure detection can be improved.
  • detecting the storage unit in the memory to obtain at least one failure includes: determining at least one failure according to data verification information of a read and write operation corresponding to the storage unit in the memory
  • data check can use ECC or parity check.
  • the at least one fault includes at least one of the following types of faults: fixed fault SAF, state transition fault TF, coupling fault CF, adjacent mode sensitive fault NPSF, or address decoder fault AF.
  • fixed fault SAF state transition fault TF
  • coupling fault CF coupling fault CF
  • adjacent mode sensitive fault NPSF adjacent mode sensitive fault NPSF
  • address decoder fault AF address decoder fault AF
  • different storage granularities include the following at least two storage granularities: stack, channel, storage library, surface, super block, block, sub-block, row, column, page, or storage unit.
  • the diversity of different storage granularities is improved, so that at least one fault is analyzed based on different granularities, and the pertinence and flexibility of fault repair can be improved.
  • the fault range includes at least one of the following: bus faults, storage space faults with different storage granularities; bus faults may include address bus faults, data bus faults, and information bus faults; different Storage space failures of storage granularity can include row failures, column failures, block failures, storage library failures, and channel failures.
  • bus faults may include address bus faults, data bus faults, and information bus faults
  • different Storage space failures of storage granularity can include row failures, column failures, block failures, storage library failures, and channel failures.
  • the fault repair strategy corresponding to the fault scope includes one of the following: switching the faulty bus to the redundant bus, and corresponding the faulty bus The channel of the fault is switched to the unused channel, the stack corresponding to the faulty bus is switched to the unused stack, or the bit width of the used bus is reduced;
  • the fault repair strategy corresponding to the fault range includes One of the following: Map faulty storage space to redundant storage space, and map faulty storage space to unused storage space.
  • the method further includes: when the fault scope is a bus fault, and the fault repair strategy corresponding to the fault scope is to switch the channel corresponding to the faulty bus to an unused channel , Migrate the data in the channel corresponding to the faulty bus to the unused channel; when the fault range is a bus fault, and the fault repair strategy corresponding to the fault range is to switch the stack corresponding to the faulty bus to the unused stack When the data in the storage die corresponding to the faulty bus is migrated to the unused storage die; when the fault scope is the storage space fault, and the fault repair strategy corresponding to the fault scope is to map the faulty storage space When it reaches the redundant storage space, the data in the faulty storage space is migrated to the redundant storage space; when the fault scope is the storage space fault, and the fault repair strategy corresponding to the fault scope is to map the faulty storage space to When the unused storage space is used, the data in the faulty storage space is migrated to the unused storage space.
  • the memory is a high-bandwidth memory.
  • the service life of the high-bandwidth memory can be prolonged and the cost can be reduced.
  • a device for repairing a memory failure including: a detection unit for detecting a storage unit in a memory to obtain at least one failure; an analysis unit for analyzing at least one failure based on different storage granularities to determine the memory failure Fault range; the repair unit is used to repair the memory according to the fault repair strategy corresponding to the fault range in the preset fault repair strategy.
  • the detection unit is specifically configured to: generate at least one set of read and write operations according to the fault detection algorithm; read and write the storage unit in the memory based on the at least one set of read and write operations to obtain at least one Fault.
  • the detection unit is further specifically configured to: determine at least one fault according to data verification information of a read and write operation corresponding to the storage unit in the memory.
  • the at least one fault includes at least one of the following types of faults: fixed fault SAF, state transition fault TF, coupling fault CF, adjacent mode sensitive fault NPSF, or address decoder fault AF.
  • different storage granularities include the following at least two storage granularities: storage die, channel, repository, surface, super block, block, sub-block, row, column, Page or storage unit.
  • the fault scope includes at least one of the following: a bus fault and a storage space fault with different storage granularities.
  • the fault repair strategy corresponding to the fault scope includes one of the following: switching the faulty bus to the redundant bus, and corresponding the faulty bus Switch to the unused channel, switch the storage die corresponding to the faulty bus to the unused storage die, or reduce the bit width of the used bus; when the fault range is a storage space fault, the fault range corresponds to The fault repair strategy includes one of the following: mapping the faulty storage space to the redundant storage space, and mapping the faulty storage space to the unused storage space.
  • the device further includes a migration unit for: when the fault range is a bus fault, and the fault repair strategy corresponding to the fault range is to switch the channel corresponding to the faulty bus to When the channel is unused, the data in the channel corresponding to the faulty bus is migrated to the unused channel; when the fault range is a bus fault, and the fault repair strategy corresponding to the fault range is to change the storage area corresponding to the faulty bus
  • the chip is switched to an unused storage die, the data in the storage die corresponding to the faulty bus is migrated to the unused storage die; when the fault scope is the storage space fault and the fault corresponding to the fault scope
  • the repair strategy is that when the faulty storage space is mapped to the redundant storage space, the data in the faulty storage space is migrated to the redundant storage space; when the fault scope is the storage space fault and the fault scope corresponds to the fault repair
  • the strategy is that when the faulty storage space is mapped to the unused storage space, the data in the faulty storage space is mapped to the
  • the memory is a high-bandwidth memory.
  • an electronic device in a third aspect, includes a processor, a storage system, a communication interface, and a bus.
  • the processor, the storage system, and the communication interface are connected by a bus.
  • the storage system includes control logic and a memory.
  • the control logic is used to support The electronic device executes the memory failure repair method provided by the foregoing first aspect or any one of the possible implementation manners of the first aspect.
  • a computer-readable storage medium stores instructions. When the instructions run on a device, the device executes the first aspect or the first aspect described above.
  • a memory failure repair method provided by any possible implementation manner.
  • a computer program product is provided.
  • the device executes the memory provided by the first aspect or any one of the possible implementations of the first aspect. Troubleshooting method.
  • FIG. 1 is a schematic structural diagram of an HBM provided by an embodiment of the application.
  • FIG. 2 is a schematic diagram of a functional model of a memory provided by an embodiment of the application.
  • FIG. 3 is a schematic flowchart of a method for repairing a memory failure according to an embodiment of the application
  • FIG. 4 is a schematic diagram of different storage granularities in a memory provided by an embodiment of this application.
  • FIG. 5 is a schematic flowchart of another method for repairing a fault in a memory according to an embodiment of the application
  • FIG. 6 is a schematic structural diagram of a storage system provided by an embodiment of this application.
  • FIG. 7 is a schematic structural diagram of a memory failure repair device provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • And/or describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects before and after are in an “or” relationship.
  • the following at least one item (a) or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • At least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • the embodiments of the present application use words such as "first" and "second” to distinguish the same items or similar items that have substantially the same function and effect.
  • the first threshold and the second threshold are only for distinguishing different thresholds, and the order of their order is not limited. Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order.
  • RAM random access memory
  • ROM read-only memory
  • RAM has fast access speed but data loss after power failure.
  • ROM has the characteristics of not losing data after power failure, but slow access speed.
  • the technical solution of the present application can be applied to various types of ROM, such as electrically programmable read-only memory (erasable programmable ROM, EPROM), electrically programmable erasable read-only memory ( Electrically-erasable programmable, E 2 PROM) and flash memory (flash), etc.
  • the technical solution of the present application can be applied to various types of RAM, such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), etc.
  • DRAM may also include multiple rate DRAM and high bandwidth memory (high bandwidth memory, HBM), etc.
  • multiple rate DRAM may include double rate DDR, DDR4, and DDR5.
  • the technical solution of the present application can also be applied to a new type of memory, such as FeRAM, MRAM, etc., and the embodiments of the present application will not be listed and described one by one.
  • HBM is a memory with high bandwidth characteristics formed by stacking and packaging multiple DRAM dies such as 2/4/8 based on a 3D stacking process.
  • HBM is suitable for equipment with high memory bandwidth requirements, such as network switching and forwarding equipment such as graphics processing equipment, routers or switches.
  • network switching and forwarding equipment such as graphics processing equipment, routers or switches.
  • the technical solution of the present application can also be applied to a memory formed by stacking multiple dies such as SRAM, MRAM, etc., and the embodiments of the present application will not be listed and described one by one.
  • FIG. 1 is a schematic structural diagram of an HBM die provided by an embodiment of the application.
  • the HBM die may include a logic die and multiple DRAM die.
  • the multiple DRAM die pass through silicon vias (TSV) and micro The micro-bumps are stacked together and connected to the logic die.
  • the logic die may be a die integrated with control logic, and the control logic may be a memory controller, which can be specifically used to manage and control the reading and writing of multiple DRAM die.
  • each DRAM die may include two channels, each channel may be 128 bits (bit), and one channel may also be referred to as one channel.
  • the HBM die includes 4 DRAM die, and each DRAM die includes two channels CH0 and CH1 as an example.
  • Figure 1 (a) is a top view of the HBM die, and ( b) is a side view of the HBM die.
  • HBM is formed by multiple DRAM dies
  • Figure 1 only takes 4 layers of multiple DRAM dies as an example for illustration.
  • HBM can also include more layers.
  • DRAM for example, 5-layer or 6-layer, etc., which is not specifically limited in the embodiment of the present application.
  • FIG. 2 is a schematic diagram of a functional model of a memory provided by an embodiment of the application.
  • the functional model of the memory includes: address latch 201, column decoder 202, row decoder 203, memory cell array 204, write driver 205, sensitive amplifier 206, data register 207 and refresh logic 208.
  • the write driver 205 and the sense amplifier 206 can also be combined as a read/write circuit.
  • the address latch 201 can be used to receive and latch an address.
  • the address can include a row address and a column address.
  • the address latch 201 can also transmit the row address to the row decoder 203 and the column address to the column translation.
  • Coder 202 The row decoder 203 is used to decode the row address, and the column decoder 202 is used to decode the column address. After a row address and a column address are decoded, they can be used to select in the memory cell array 204.
  • the memory cell array 204 is used to store data.
  • the memory cell is the smallest storage unit in the memory cell array 204, and one memory cell can be used to store a binary code.
  • the read/write circuit is used to control the working state of the memory, for example, control data to be read from the memory cell array 204 and control data to be written to the memory cell array 204, etc.
  • the data register 207 is used to register data to be written in the memory cell array 204, data read from the memory cell array 204, and the like.
  • the refresh logic 208 is used to refresh the row decoder 203 and the column decoder 202.
  • the faults of the memory can be divided into three categories: address decoder faults, read-write logic module faults, and memory cell array faults. These three types of faults are introduced and explained respectively below.
  • the first category address decoder failure
  • Address decoder failure refers to a failure in the address decoding logic, which is mainly manifested in four forms: for a certain address, there is no corresponding storage unit corresponding to the address; for a certain storage unit, there is no One address can select the storage unit; for a certain address, two or more storage units can be selected at the same time; multiple addresses can select a storage unit at the same time.
  • the fault of the read-write logic module is mainly manifested in the read-write circuit, the logic part of the read-out or write-in driver of some detection amplifiers may produce open-circuit, short-circuit or fixed input/output (I/O) faults. Store cross-coupling interference between the data lines of the write circuit.
  • the third category storage unit array failure
  • the memory cell array is the most complex module in the memory, the probability of failure is the greatest, and the type of failure is also the most complicated, mainly caused by open circuit, short circuit and crosstalk of the data lines in the memory cell array.
  • memory faults can be divided into the following five functional faults: fixed fault (stuck-at fault, SAF), transition fault (transition fault) fault, TF), coupling fault (CF), neighboring pattern sensitive faults (NPSF), and address decoder fault (AF).
  • SAF fixed fault
  • transition fault transition fault
  • TF transition fault
  • CF coupling fault
  • NPSF neighboring pattern sensitive faults
  • AF address decoder fault
  • Conversion failure TF A memory cell failure prevents the 0 ⁇ 1 transition or the 1 ⁇ 0 transition from occurring.
  • the English is "A cell or a line that fails to undergo a 0 ⁇ 1or 1 ⁇ 0transition.”
  • Coupling fault CF A write operation to one storage unit changes the content of another storage unit. English is "A write operation to one cell changes the content of the second cell.”.
  • Adjacent vector sensitization failure NPSF The content of a storage unit, or the ability to change the content of the storage unit, is affected by the content of other storage units in the storage unit array. English is "The content of a cell, or the ability to change its content, is influenced by the content of some other cell in memory.”
  • Address decoding failure AF Any failure that affects the address decoder. English is "Any fault that affect address decoder.”. It is mainly manifested in the four manifestations described in the above-mentioned first type of address decoder failure.
  • memory failures may also include other failure forms.
  • the embodiment of the present application only takes the above five types of failures as examples for description, and other failure forms are not listed and described here.
  • the embodiment of the present application provides a method for repairing a memory failure.
  • the basic principle of the method is to obtain the fault range of the memory by detecting the faults of the memory, and analyzing and counting these faults, so that the fault range can be repaired in a timely and effective manner. , To ensure the correctness and reliability of the memory function.
  • FIG. 3 is a schematic flowchart of a method for repairing a memory failure according to an embodiment of the application.
  • the method is applied to a storage system including control logic and a memory.
  • the method includes the following steps.
  • S301 Detect a storage unit in the memory, and obtain at least one fault.
  • At least one fault may include one or more faults, and the one or more faults may include at least one of the following types of faults: fixed fault SAF, state transition fault TF, coupling fault CF, adjacent mode sensitive fault NPSF or address Decoder failure AF.
  • Each of the at least one failure may be used to indicate the storage unit that has failed and the type of failure of the storage unit.
  • detecting the storage unit in the memory to obtain at least one fault may include: generating at least one set of read and write operations according to the fault detection algorithm; reading and writing the storage unit in the memory based on the at least one set of read and write operations, Get at least one failure.
  • the fault detection algorithm can include a variety of different fault detection algorithms, for example, Checkerboard algorithm (chessboard method), Gallop algorithm (running method), March algorithm (process method), MSCAN algorithm (all 0 all 1 algorithm) and butterfly Algorithm (butterfly method), etc.
  • Each fault detection algorithm corresponds to a different detection mode.
  • corresponding read and write operations can be generated.
  • the generated read and write operations can include at least one set of read and write operations, and at least one set of read and write operations includes one or more Group read and write operations, so that at least one failure can be obtained by reading and writing the storage unit in the memory based on this or more groups of read and write operations.
  • the detection process corresponding to the March algorithm may be: Write 0 in memory cells A0 to An-1; read 0 in memory cells A0 to An-1 and write 1 in sequence; read 1 in memory cells An-1 to A0 and write 0 in sequence, so ; Read 0 in the memory cells An-1 to A0.
  • the types of faults that can be detected by different fault detection algorithms may be the same or different.
  • the March algorithm can detect fixed faults SAF, address decoder faults AF, and conversion faults TF
  • Gallop algorithm can detect fixed faults.
  • Checkerboard algorithm can detect fixed fault SAF and adjacent mode sensitive fault NPSF.
  • detecting the storage unit in the memory to obtain at least one failure may include: determining the at least one failure according to data verification information of a read and write operation corresponding to the storage unit in the memory.
  • the data check bit can be generated according to the written data during the write operation, and the read data can be checked according to the data check bit generated during the write during the read operation.
  • data verification information is obtained, and the data verification may use error detection and correction (error correcting code, ECC) or parity check, etc.
  • ECC error correcting code
  • the read data verification succeeds, it can be determined that the currently used storage unit has not failed, and when the read data verification fails, it can be determined that the currently used storage unit has failed.
  • the success or failure of the obtained data verification information the above-mentioned at least one failure can be determined.
  • S302 Analyze at least one fault based on different storage granularities to determine the fault range of the storage.
  • different storage granularities can include the following at least two storage granularities: storage die, channel, bank, plane, super block, block, and sub-block. -block), row (row), column (column), page (page) or storage unit (cell).
  • the storage die here can also be referred to as a stack.
  • a DRAM die in the HBM can be referred to as a storage die or a stack.
  • the division of storage granularity in different memories may be the same or different.
  • a storage unit is the smallest storage granularity (also called a storage unit); a row can include multiple storage units, and these multiple storage units can be continuous and located on a straight line
  • One block can include multiple consecutive storage units, and multiple storage units can be rectangular;
  • one storage library can include multiple blocks;
  • one channel can include multiple storage libraries;
  • one storage die can include multiple channels .
  • the fault range of the memory may include: bus faults and storage space faults with different storage granularities.
  • Bus failures can include: data bus failure, address bus failure, and control bus failure; storage space failures of different storage granularities can include: storage die failure, channel failure, storage library failure, plane failure, super block failure, block failure, sub Block failure, row failure, column failure, page failure, or storage unit failure.
  • types of bus faults and the types of storage space faults with different storage granularities listed here are only exemplary. In actual applications, they may also include other different types of faults, which are not discussed in the embodiments of this application. Specific restrictions.
  • control logic may perform statistical analysis on at least one fault based on different storage granularities such as storage dies, channels, storage libraries, blocks, rows, and storage units, to obtain the fault range of the memory. For example, count the number of storage units that have a certain failure in the same row. If the number of failed storage units reaches the first threshold, you can determine that the row is faulty; for another example, count the number of storage units that have failed in the same storage library. If the number of failed storage units reaches the second threshold, it can be determined that the storage library is faulty.
  • different storage granularities such as storage dies, channels, storage libraries, blocks, rows, and storage units
  • the control logic can perform model training based on multiple different faults in advance to obtain training models of different fault ranges.
  • the training models of different fault ranges may include fault models of different buses and faults of storage spaces with different storage granularities.
  • Model For example, the failure models of different buses can include data bus failure models, address bus failure models, and control bus failure models; the failure models of storage spaces with different storage granularities can include storage die failure models, channel failure models, and storage library failure models. Row failure model and storage unit failure model, etc.
  • the control logic can analyze at least one fault according to training models of different fault ranges, so as to determine the fault range of the memory.
  • the preset fault repair strategy can be configured in advance, the preset fault repair strategy can include multiple fault repair strategies corresponding to different fault ranges, and the fault repair strategy corresponding to each fault range can include one or more fault repairs. Strategy.
  • the fault repair strategy corresponding to the bus fault may include at least one of the following: switching the faulty bus to the redundant bus, switching the channel corresponding to the faulty bus to the unused channel, and resetting the fault
  • the storage die corresponding to the bus is switched to the unused storage die, or the bit width of the used bus is reduced.
  • the bus may include a data bus, an address bus, and a control bus; specifically, when the fault scope is specifically a data bus failure, the fault repair strategy corresponding to the data bus failure may include at least one of the following: switching the failed data bus to redundancy The data bus switches the channel corresponding to the faulty data bus to the unused channel, switches the storage die corresponding to the faulty data bus to the unused storage die, or reduces the bit width of the used data bus.
  • fault repair strategy corresponding to the address bus fault and the fault repair strategy corresponding to the control bus fault are similar to the fault repair strategy corresponding to the above data bus fault.
  • the failure repair strategy corresponding to the storage space failure includes one of the following: mapping the failed storage space to the redundant storage space, and mapping the failed storage space to the Unused storage space.
  • storage space failures of different storage granularities may include storage die failures, channel failures, storage library failures, row failures, and storage unit failures.
  • the fault repair strategy corresponding to the storage library failure may include at least one of the following: mapping the failed storage library to a redundant storage library, and mapping the failed storage library to an unused storage library .
  • failure recovery strategy corresponding to storage space failures of different storage granularities other than the storage library is similar to the failure recovery strategy corresponding to the above-mentioned storage library failure.
  • the failure recovery strategy corresponding to the storage library failure is similar to the failure recovery strategy corresponding to the above-mentioned storage library failure.
  • the embodiments of this application will not be repeated here.
  • the method may further include: S304.
  • S304 Migrate storage data related to the fault range in the memory.
  • the storage data related to the fault scope in the migration storage may specifically be: Migrate the data in the channel corresponding to the faulty bus to the unused channel; when the fault range is a bus fault, and the fault repair strategy corresponding to the fault range is to switch the storage die corresponding to the faulty bus to the unused
  • migrating the storage data related to the fault range in the storage may specifically be: migrating the data in the storage die corresponding to the faulty bus to the unused storage die.
  • the storage data related to the fault scope in the migration storage may specifically be: Migrate the data in the faulty storage space to the redundant storage space; when the fault scope is the storage space fault and the fault repair strategy corresponding to the fault scope is to map the faulty storage space to the unused storage space,
  • the storage data related to the failure scope in the migration storage may specifically be: migrating the data in the failure storage space to the unused storage space.
  • the storage system includes: a storage test module 401, a storage verification module 402, a fault analysis module 403, a fault processing module 404, a memory 405, and a replacement resource storage module 406.
  • the storage test module 401 can detect the storage unit in the memory 405 according to the fault detection algorithm.
  • the storage test module 401 can detect the storage in the memory 405 according to the Checkerboard algorithm, Gallop algorithm, March algorithm, and butterfly algorithm (butterfly method). Unit, get at least one failure.
  • the storage verification module 402 can determine at least one failure according to the data verification information of the reading and writing operation corresponding to the storage unit in the memory 405.
  • the data check bit is generated according to the written data, and the read data is checked according to the data check bit generated during writing during the read operation, so as to obtain the data check information.
  • the fault analysis module 403 may collect at least one fault determined by the storage test module 401 and/or the storage verification module 402, and analyze the at least one fault based on different storage granularities such as storage units, rows, blocks, and storage libraries, to determine the fault range of the storage For example, perform fault classification and statistics on at least one fault; the fault processing module 404 can repair the memory 405 according to the fault repair strategy corresponding to the fault range in the preset fault repair strategy. For example, the fault analysis module 403 can generate the corresponding fault repair The operation instruction of the strategy is transmitted to the fault processing module 404, and the fault processing module 404 repairs the memory 405 according to the received operation instruction.
  • the replacement resource storage module 406 may include storage spaces with different storage granularities.
  • the replacement resource storage module 406 includes redundant storage units, redundant rows, redundant blocks, redundant storage libraries, etc., when the fault range corresponds to a fault
  • the redundant storage space included in the replacement resource storage module 406 can be used to repair the failed storage space in the storage 405, for example, using redundant blocks to repair the storage space in the storage 405 Faulty block, etc.
  • the control logic detects the faults in the memory, and performs analysis operations such as classification and statistics on the detected faults to obtain the fault range, and then repairs according to the fault repair strategy corresponding to the fault range in the preset fault repair strategy
  • the memory so that when the memory fails, it can effectively repair a variety of different granularities or different fault ranges in time to ensure the normal use of the memory, thereby improving the accuracy and reliability of the memory, and improving the memory at the same time. Service life.
  • the storage system includes hardware structures and/or software modules corresponding to each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • the embodiment of the present application may divide the storage system into functional modules according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 7 is a memory failure repair device provided by an embodiment of the application.
  • the device includes a detection unit 501, an analysis unit 502, and a repair unit 503.
  • the detection unit 501 is used to detect the storage unit in the memory to obtain at least one fault;
  • the analysis unit 502 is used to analyze the at least one fault based on different storage granularities to determine the fault range of the memory;
  • the repair unit 503 is used to determine the fault range of the memory; Set the fault repair strategy corresponding to the fault range in the fault repair strategy to repair the memory.
  • the at least one fault includes at least one of the following types of faults: fixed fault SAF, state transition fault TF, coupling fault CF, adjacent mode sensitive fault NPSF, or address decoder fault AF.
  • Different storage granularities include the following at least two storage granularities: storage die, channel, repository, surface, super block, block, sub-block, row, column, page, or storage unit.
  • the fault scope includes at least one of the following: bus faults, storage space faults with different storage granularities.
  • the memory may be a high-bandwidth memory.
  • the detection unit 501 is specifically configured to generate at least one set of read and write operations according to a fault detection algorithm; read and write a storage unit in the memory based on the at least one set of read and write operations to obtain at least one fault. And/or, the detection unit 501 is further specifically configured to determine at least one failure according to the data verification information of the read and write operation corresponding to the storage unit in the memory.
  • the fault repair strategy corresponding to the fault scope when the fault scope is a bus fault, includes one of the following: switch the faulty bus to the redundant bus, and switch the channel corresponding to the faulty bus To the unused channel, switch the storage die corresponding to the faulty bus to the unused storage die, or reduce the bit width of the used bus; when the fault range is a storage space fault, the fault repair strategy corresponding to the fault range Including one of the following: mapping the faulty storage space to the redundant storage space, and mapping the faulty storage space to the unused storage space.
  • the device further includes: a migration unit 504.
  • the migration unit 504 is configured to: when the fault scope is a bus fault and the fault repair strategy corresponding to the fault scope is to switch the channel corresponding to the faulty bus to an unused channel, migrate the data in the channel corresponding to the faulty bus To the unused channel; when the fault range is a bus fault, and the fault repair strategy corresponding to the fault range is to switch the storage die corresponding to the faulty bus to the unused storage die, the faulty bus corresponding to the The data in the storage die is migrated to the unused storage die; when the failure scope is the storage space failure and the fault repair strategy corresponding to the failure scope is to map the failed storage space to the redundant storage space, the The data in the faulty storage space is migrated to the redundant storage space; when the fault scope is the storage space fault and the fault recovery strategy corresponding to the fault scope is to map the faulty storage space to the unused storage space, The data in the faulty storage space is migrated to the unused storage space.
  • the electronic device includes a processor 601, a storage system 602, a communication interface 603, and a bus 604.
  • the processor 601, the storage system 602, and the communication interface 603 pass through the bus 604.
  • the storage system 602 includes a control logic and a memory, and the control logic is used to support the electronic device to perform the memory failure repair method provided above.
  • the control logic may be integrated with the processor 601 or may be integrated with the memory. In FIG. 8, the integration of the control logic and the memory is taken as an example for illustration.
  • the processor 601 may be a central processing unit, a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processor may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.
  • the bus 604 in FIG. 8 may be a peripheral component interconnection standard (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in the foregoing FIG. 8, but it does not mean that there is only one bus or one type of bus.
  • the fault range is obtained by detecting the fault in the memory, and performing analysis operations such as classification and statistics on the detected faults, and then repairing the memory according to the fault repair strategy corresponding to the fault range in the preset fault repair strategy, Therefore, when the memory fails, a variety of different granularities or different fault ranges can be effectively repaired in time to ensure the normal use of the memory, thereby improving the accuracy and reliability of the memory, and improving the use of the memory at the same time life.
  • a computer-readable storage medium stores computer-executable instructions.
  • the device executes The memory failure repair method provided above.
  • a computer program product in another embodiment of the present application, includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; at least one processor of the device can be accessed from a computer.
  • the read storage medium reads the computer-executable instruction, and at least one processor executes the computer-executable instruction to make the device implement the method for repairing a memory failure provided above.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is realized in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art or the part of the technical solutions can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

一种存储器故障修复方法及装置,涉及存储器技术领域,用于实现不同粒度或不同故障范围的故障修复,提高了存储器的故障修复功能。所述方法应用于包括控制逻辑和存储器的存储系统中,包括:检测所述存储器中的存储单元,得到至少一个故障(S301);基于不同存储粒度分析所述至少一个故障,以确定所述存储器的故障范围(S302);根据预设故障修复策略中所述故障范围对应的故障修复策略修复所述存储器(S303)。

Description

一种存储器故障修复方法及装置 技术领域
本申请涉及存储器技术领域,尤其涉及一种存储器故障修复方法及装置。
背景技术
存储器(memory)是各种电子设备保存信息的主要部件,可用于存储操作代码和数据文件等信息。随着存储器芯片的密度、复杂度日益提高,以及存储器的广泛应用,如何实现存储器的故障修复,对保证存储器功能的正确性和可靠性尤为重要。
现有技术中,制造商通常会在存储器中设置冗余行和冗余列,当存储器中的某一存储单元发生故障时,可将故障的存储单元隔离起来,通过冗余行或冗余列来替代故障的存储单元所在的行或列。但是,这种隔离修复方式只能实现行或列的故障修复,从而修复的范围有限。
发明内容
本申请提供一种存储器故障修复方法及装置,用于实现不同粒度或不同故障范围的故障修复,提高了存储器的故障修复功能。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,提供一种存储器故障修复方法,应用于包括控制逻辑和存储器的存储系统中,包括:检测该存储器中的存储单元,得到至少一个故障;基于不同存储粒度分析至少一个故障,以确定该存储器的故障范围;根据预设故障修复策略中故障范围对应的故障修复策略修复存储器。
上述技术方案中,控制逻辑通过检测存储器中的故障,并对检测到的故障进行分类和统计等分析操作以得到故障范围,之后根据预设故障修复策略中该故障范围对应的故障修复策略修复存储器,从而在存储器故障时,可以及时对多种不同粒度或不同的故障范围的存储器故障进行有效地修复,以保证存储器的正常使用,进而提高了存储器的正确性和可靠性,同时提高了存储器的使用寿命。
在第一方面的一种可能的实现方式中,检测该存储器中的存储单元,得到至少一个故障,包括:根据故障检测算法生成至少一组读写操作,比如,故障检测算法包括Checkerboard算法(棋盘法)、Gallop算法(奔跳法)、March算法(进行法)、MSCAN算法(全0全1算法)和butterfly算法(蝶形法)等;基于至少一组读写操作读写该存储器中的存储单元,得到至少一个故障,每个故障可以用于指示发生故障的存储单元和故障的种类。上述可能的实现方式中,通过故障检测算法可以检测出存储器中的存储单元发生的一个或者多个故障,从而可以提高故障检测的准确性和高效性。
在第一方面的一种可能的实现方式中,检测该存储器中的存储单元,得到至少一个故障,包括:根据该存储器中的存储单元对应的读写操作的数据校验信息,确定至少一个故障,比如,数据校验可以使用ECC或奇偶校验等。上述可能的实现方式中,通过读写操作的数据校验信息确定至少一个故障的方式简单、有效。
在第一方面的一种可能的实现方式中,至少一个故障包括以下至少一种类型的故 障:固定故障SAF、状态转换故障TF、耦合故障CF、相邻模式敏感故障NPSF或者地址译码器故障AF。上述可能的实现方式中,提高了确定的至少一个故障的多样性。
在第一方面的一种可能的实现方式中,不同存储粒度包括以下至少两种存储粒度:堆叠、通道、存储库、面、超级块、块、子块、行、列、页面或者存储单元。上述可能的实现方式中,提高了不同存储粒度的多样性,从而基于不同粒度分析至少一个故障,可以提高了故障修复的针对性和灵活性。
在第一方面的一种可能的实现方式中,该故障范围包括以下至少一项:总线故障、不同存储粒度的存储空间故障;总线故障可以包括地址总线故障、数据总线故障和信息总线故障;不同存储粒度的存储空间故障可以包括行故障、列故障、块故障、存储库故障和通道故障等。上述可能的实现方式中,提高了故障范围的多样性,进而基于不同的故障范围可以针对性地设置不同的故障修复策略,从而提高故障修复的有效性和灵活性。
在第一方面的一种可能的实现方式中,当该故障范围为总线故障时,该故障范围对应的故障修复策略包括以下中的一种:将故障总线切换至冗余总线,将故障总线对应的通道切换至未使用的通道,将故障总线对应的堆叠切换至未使用的堆叠,或者降低使用的总线的位宽;当该故障范围为存储空间故障时,该故障范围对应的故障修复策略包括以下中的一种:将故障存储空间映射至冗余存储空间,将故障存储空间映射至未使用的存储空间。上述可能的实现方式中,对于不同的故障范围,设置有不同的故障修复策略,从而可以提高故障修复的有效性和灵活性。
在第一方面的一种可能的实现方式中,该方法还包括:当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的通道切换至未使用的通道时,将该故障总线对应的通道中的数据迁移至该未使用的通道;当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的堆叠切换至未使用的堆叠时,将该故障总线对应的存储裸片中的数据迁移至该未使用的存储裸片;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至冗余存储空间时,将该故障存储空间中的数据迁移至该冗余存储空间;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至未使用的存储空间时,将该故障存储空间中的数据迁移至该未使用的存储空间。上述可能的实现方式中,可以在故障修复之后将该故障范围相关的存储数据进行迁移,从而保证了存储数据的完整性和安全性,同时也不影响后续对于存储数据的正常访问。
在第一方面的一种可能的实现方式中,该存储器为高带宽存储器。上述可能的实现方式中,通过该方法修复高带宽存储器中的不同故障范围,可以延长高带宽存储器的使用寿命,降低成本。
第二方面,提供一种存储器故障修复装置,包括:检测单元,用于检测存储器中的存储单元,得到至少一个故障;分析单元,用于基于不同存储粒度分析至少一个故障,以确定该存储器的故障范围;修复单元用于,根据预设故障修复策略中所述故障范围对应的故障修复策略修复该存储器。
在第二方面的一种可能的实现方式中,检测单元具体用于:根据故障检测算法生成至少一组读写操作;基于至少一组读写操作读写该存储器中的存储单元,得到至少 一个故障。
在第二方面的一种可能的实现方式中,检测单元还具体用于:根据该存储器中的存储单元对应的读写操作的数据校验信息,确定至少一个故障。
在第二方面的一种可能的实现方式中,至少一个故障包括以下至少一种类型的故障:固定故障SAF、状态转换故障TF、耦合故障CF、相邻模式敏感故障NPSF或者地址译码器故障AF。
在第二方面的一种可能的实现方式中,不同存储粒度包括以下至少两种存储粒度:存储裸片、通道、存储库、面、超级块、块、子块、行(row)、列、页面或者存储单元。
在第二方面的一种可能的实现方式中,该故障范围包括以下至少一项:总线故障、不同存储粒度的存储空间故障。
在第二方面的一种可能的实现方式中,当该故障范围为总线故障时,该故障范围对应的故障修复策略包括以下中的一种:将故障总线切换至冗余总线,将故障总线对应的通道切换至未使用的通道,将故障总线对应的存储裸片切换至未使用的存储裸片,或者降低使用的总线的位宽;当该故障范围为存储空间故障时,该故障范围对应的故障修复策略包括以下中的一种:将故障存储空间映射至冗余存储空间,将故障存储空间映射至未使用的存储空间。
在第二方面的一种可能的实现方式中,该装置还包括迁移单元,用于:当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的通道切换至未使用的通道时,将该故障总线对应的通道中的数据迁移至该未使用的通道;当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的存储裸片切换至未使用的存储裸片时,将该故障总线对应的存储裸片中的数据迁移至该未使用的存储裸片;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至冗余存储空间时,将该故障存储空间中的数据迁移至该冗余存储空间;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至未使用的存储空间时,将该故障存储空间中的数据迁移至该未使用的存储空间。
在第二方面的一种可能的实现方式中,该存储器为高带宽存储器。
第三方面,提供一种电子设备,该电子设备包括处理器、存储系统、通信接口和总线,处理器、存储系统和通信接口通过总线连接,存储系统包括控制逻辑和存储器,控制逻辑用于支持该电子设备执行上述第一方面或第一方面的任一种可能的实现方式所提供的存储器故障修复方法。
在本申请的又一方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在设备上运行时,使得该设备执行上述第一方面或第一方面的任一种可能的实现方式所提供的存储器故障修复方法。
在本申请的又一方面,提供一种计算机程序产品,当该计算机程序产品在设备上运行时,使得该设备执行上述第一方面或第一方面的任一种可能的实现方式所提供的存储器故障修复方法。
可以理解地,上述提供的任一种存储器故障修复方法的装置、电子设备、计算机 可读存储介质和计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种HBM的结构示意图;
图2为本申请实施例提供的一种存储器的功能模型的示意图;
图3为本申请实施例提供的一种存储器故障修复方法的流程示意图;
图4为本申请实施例提供的一种存储器中不同存储粒度的示意图;
图5为本申请实施例提供的另一种存储器的故障修复方法的流程示意图;
图6为本申请实施例提供的一种存储系统的结构示意图;
图7为本申请实施例提供的一种存储器故障修复装置的结构示意图;
图8为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,本申请实施例采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。例如,第一阈值和第二阈值仅仅是为了区分不同的阈值,并不对其先后顺序进行限定。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
本申请的技术方案可以应用于各种存储器中,比如随机存取存储器(random access memory,RAM)和只读存储器(read-only memory,ROM),RAM具有访问速度快、但掉电数据丢失的特点,ROM具有掉电数据不丢失、但是访问速度慢的特点。在另一种可能的实施例中,本申请的技术方案可应用于各种类型的ROM中,比如电可编程只读存储器(erasable programmable ROM,EPROM)、电可编程可擦除只读存储器(electrically-erasable programmable,E 2PROM)和闪存(flash)等。在一种可能的实施例中,本申请的技术方案可应用于各种类型的RAM中,比如静态随机存储器(static RAM,SRAM)、动态随机存储器(dynamic RAM,DRAM)、同步动态随机存储器(synchronous DRAM,SDRAM)等。DRAM还可以包括多倍率的DRAM和高带宽存储器(high bandwidth memory,HBM)等,多倍率的DRAM可以包括双倍速率DDR、DDR4和DDR5等。此外,本申请的技术方案还可以应用于新型的存储器中,比如FeRAM、MRAM等存储器等,本申请实施例不再一一列举描述。
上述HBM是基于3D堆栈工艺将2/4/8等多个DRAM裸片(die)堆叠在一起并 封装形成的具有高带宽特性的存储器。HBM适用于高存储器带宽需求的设备中,比如图形处理设备、路由器或者交换器等网络交换及转发设备中。本申请的技术方案还可以应用于SRAM,MRAM等多个裸片堆叠构成的存储器,本申请实施例不再一一列举描述。
下面以多个DRAM堆叠构成HBM为例,对通过堆叠构成的存储器进行举例说明,下述图1并不对本申请实施例构成限制。图1为本申请实施例提供的一种HBM裸片的结构示意图,HBM裸片可以包括逻辑裸片(logic die)和多个DRAM裸片,多个DRAM裸片通过硅穿孔(TSV)和微凸起(micro-bump)堆叠在一起且与逻辑裸片相连接。逻辑裸片可以是集成有控制逻辑的裸片,该控制逻辑可以是存储器控制器,具体可用于管理和控制器多块DRAM裸片的读写等。
其中,每个DRAM裸片中可以包括两个通道,每个通道可以为128比特(bit),一个通道也可以称为一个信道。图1中以HBM裸片包括4个DRAM裸片,每个DRAM裸片包括CH0和CH1两个通道为例进行说明,图1中的(a)为HBM裸片的俯视图,图1中的(b)为HBM裸片的侧视图。
需要说明的是,在通过多个DRAM裸片构成HBM时,图1中仅以多个DRAM裸片的层数为4层为例进行说明,在实际应用中,HBM还可以包括更多层的DRAM,比如,5层或6层等,本申请实施例对此不作具体限制。
图2为本申请实施例提供的一种存储器的功能模型的示意图。如图2所示,该存储器的功能模型包括:地址锁存器201、列译码器202、行译码器203、存储单元阵列204、写驱动器205、灵敏放大器206、数据寄存器207和刷新逻辑208。写驱动器205和灵敏放大器206也可以合并称为读/写电路。
其中,地址锁存器201可用于接收地址并锁存,该地址可以包括行地址和列地址,地址锁存器201还可以将行地址传输至行译码器203、将列地址传输至列译码器202。行译码器203用于对行地址进行译码,列译码器202用于对列地址进行译码,一个行地址和一个列地址被译码后可共同用于在存储单元阵列204中选中一个待读/写的存储单元(cell)。存储单元阵列204用于存储数据,存储单元是存储单元阵列204中最小的存储单位,一个存储单元可用于存储一个二进制代码。读/写电路用于控制对存储器的工作状态,比如,控制数据从存储单元阵列204中读出、以及控制数据写入存储单元阵列204中等。数据寄存器207用于寄存待写入存储单元阵列204中的数据,以及从存储单元阵列204中读出的数据等。刷新逻辑208用于刷新行译码器203和列译码器202。
基于上述图2所示的存储器的功能模型,可以将存储器的故障分为三大类:地址译码器故障、读写逻辑模块故障和存储单元阵列故障。下面分别对这三类故障进行介绍说明。
第一类、地址译码器故障
地址译码器故障是指地址译码逻辑中的产生的故障,主要表现为四种形式:对于某个确定的地址,没有相应的存储单元与该地址对应;对于某个确定的存储单元,没有一个地址能够选中该存储单元;对于某个确定的地址,能够同时选中两个或者两个以上的存储单元;多个地址同时选中一个存储单元。
第二类、读写逻辑模块故障
读写逻辑模块故障主要表现为在读写电路中,某些检测放大器的读出或者写入驱动器的逻辑部分可能产生开路、短路或者输入/输出(input/output,I/O)固定的故障,在写电路的数据线之间存储交叉耦合干扰。
第三类、存储单元阵列故障
由于存储单元阵列是存储器内规模最为复杂的一个模块,因此出现故障的概率最大,故障的类型也最为复杂,主要是由存储单元阵列内的数据线开路、短路以及串扰所引起的。
结合上述地址译码器故障、读写逻辑模块故障和存储单元阵列故障的不同表现形式,可以将存储器故障分为以下五种功能故障:固定故障(stuck-at fault,SAF)、转换故障(transition fault,TF)、耦合故障(coupling fault,CF)、相邻矢量敏化故障(neighborhood pattern sensitive faults,NPSF)和地址译码故障(address decoder fault,AF)。下面分别对这五种功能故障进行介绍说明。
固定故障SAF:可以是指一个存储单元或一个连线的逻辑值总为0或总为1。英文为“The logic value of a cell or a line is always 0 or 1.”。
转换故障TF:一个存储单元失效使得0→1转变或1→0转变不能发生。英文为“A cell or a line that fails to undergo a 0→1or 1→0transition.”。
耦合故障CF:对一个存储单元中写操作改变了另一个存储单元的内容。英文为“A write operation to one cell changes the content of the second cell.”。
相邻矢量敏化故障NPSF:一个存储单元的内容,或者改变该存储单元内容的能力,受到存储单元阵列中其他存储单元的内容的影响。英文为“The content of a cell,or the ability to change its content,is influenced by the content of some other cell in memory.”。
地址译码故障AF:任何影响地址译码器的故障。英文为“Any fault that affect address decoder.”。主要表现为上述第一类地址译码器故障中所述的四种表现形式。
需要说明的是,在实际应用中,存储器故障还可以包括其他的故障形式,本申请实施例仅以上述5种故障为例进行说明,对于其他的故障形式在此不再一一列举描述。
基于此,本申请实施例提供一种存储器故障修复方法,该方法的基本原理在于:通过检测存储器的故障,以及分析统计这些故障以得到存储器的故障范围,从而针对该故障范围进行及时有效地修复,以保证存储器功能的正确性和可靠性。
图3为本申请实施例提供的一种存储器故障修复方法的流程示意图,该方法应用于包括控制逻辑和存储器的存储系统中,该方法包括以下几个步骤。
S301:检测存储器中的存储单元,得到至少一个故障。
其中,至少一个故障可以包括一个或者多个故障,这一个或者多个故障可以包括以下至少一种类型的故障:固定故障SAF、状态转换故障TF、耦合故障CF、相邻模式敏感故障NPSF或者地址译码器故障AF。至少一个故障中的每个故障可以用于指示发生故障的存储单元,以及该存储单元的故障类型。
在一种可能的实现方式中,检测存储器中的存储单元,得到至少一个故障可以包括:根据故障检测算法生成至少一组读写操作;基于至少一组读写操作读写存储器中的存储单元,得到至少一个故障。
其中,故障检测算法可以包括多种不同的故障检测算法,比如,Checkerboard算法(棋盘法)、Gallop算法(奔跳法)、March算法(进行法)、MSCAN算法(全0全1算法)和butterfly算法(蝶形法)等。每种故障检测算法对应不同的检测模式,根据检测算法的检测模式可以生成对应的读写操作,生成的读写操作可以包括至少一组读写操作,至少一组读写操作包括一组或多组读写操作,从而基于这一组或多组读写操作对存储器中的存储单元进行读写,即可得到至少一个故障。
示例性的,若该故障检测算法包括March算法、存储单元阵列中包括n(n为大于1的整数)个存储单元(表示为A0至An-1),则March算法对应的检测过程可以为:在存储单元A0至An-1中写入0;依次读取存储单元A0至An-1中的0并写入1;依次读取存储单元An-1至A0中的1并写入0,这样;读取存储单元An-1至A0中的0。
需要说明的是,不同的故障检测算法对应的检测过程不同,各种故障检测算法的具体检测过程可以参见相关技术的描述,本申请实施例仅以March算法的检测过程为例进行说明。
另外,不同的故障检测算法能够检测到的故障的种类可能相同,也可能不同,比如,March算法可以检测出固定故障SAF、地址译码器故障AF和转换故障TF,Gallop算法可以检测出固定故障SAF、转换故障TF和耦合故障CF,Checkerboard算法可以检测出固定故障SAF和相邻模式敏感故障NPSF等。
在另一种可能的实现方式中,检测存储器中的存储单元,得到至少一个故障可以包括:根据存储器中的存储单元对应的读写操作的数据校验信息,确定至少一个故障。
具体的,在存储器的正常读写过程中,可在写操作时根据写入的数据生成数据校验位,在读操作时根据写入时生成的数据校验位对读出的数据进行校验,从而得到数据校验信息,数据校验可以使用错误检测和更正(error correcting code,ECC)或奇偶校验等。比如,当读出的数据校验成功时可以确定当前使用的存储单元未发生故障,当读出的数据校验失败时可以确定当前使用的存储单元发生故障。根据得到的数据校验信息的成功与否,可以确定上述至少一个故障。
S302:基于不同存储粒度分析至少一个故障,以确定存储器的故障范围。
其中,不同存储粒度可以包括以下至少两种存储粒度:存储裸片、通道(channel)、存储库(bank)、面(plane)、超级块(super block)、块(block)、子块(sub-block)、行(row)、列(column)、页面(page)或者存储单元(cell)。这里的存储裸片也可以称为堆叠(stack),比如,在HBM中的一个DRAM裸片便可以称为一个存储裸片或一个堆叠。
另外,不同存储器中的存储粒度的划分可以是相同的,也可以是不同的。同一存储器中包括的多个存储粒度之间可以存在大小关系,下面以同一存储器中的多个存储粒度包括存储裸片、通道、存储库、块、行和存储单元为例,对这些存储粒度的大小关系进行举例说明。
示例性的,如图4所示,存储单元是最小的存储粒度(也可以称为存储单位);一个行中可以包括多个存储单元,这多个存储单元可以是连续的且位于一条直线上;一个块中可以包括连续的多个存储单元,多个存储单元可以呈矩形;一个存储库中可以包括多个块;一个通道中可以包括多个存储库;一个存储裸片可以包括多个通道。
再者,存储器的故障范围可以包括:总线故障、以及不同存储粒度的存储空间故障。总线故障可以包括:数据总线故障、地址总线故障和控制总线故障;不同存储粒度的存储空间故障可以包括:存储裸片故障、通道故障、存储库故障、面故障、超级块故障、块故障、子块故障、行故障、列故障、页面故障或者存储单元故障。需要说明的是,这里所列举的总线故障的种类和不同存储粒度的存储空间故障的种类仅为示例性的,在实际应用中,还可以包括其它不同种类的故障,本申请实施例对此不作具体限定。
具体的,控制逻辑可以基于存储裸片、通道、存储库、块、行和存储单元等不同的存储粒度下对至少一个故障进行统计分析,以得到存储器的故障范围。比如,统计同一行中发生某一故障的存储单元的数量、若发生故障的存储单元的数量达到第一阈值,则可以确定该行故障;再比如,统计同一存储库中发生故障的存储单元的数量,若发生故障的存储单元的数量达到第二阈值,则可以确定该存储库故障。
可选的,控制逻辑可以事先基于多个不同的故障进行模型训练,以得到不同故障范围的训练模型,不同故障范围的训练模型可以包括不同总线的故障模型、以及不同存储粒度的存储空间的故障模型。比如,不同总线的故障模型可以包括数据总线故障模型、地址总线故障模型和控制总线故障模型;不同存储粒度的存储空间的故障模型可以包括存储裸片故障模型、通道故障模型、存储库故障模型、行故障模型和存储单元故障模型等。这样,控制逻辑可以根据不同故障范围的训练模型对至少一个故障进行分析,从而确定存储器的故障范围。
S303:根据预设故障修复策略中该故障范围对应的故障修复策略修复存储器。
其中,预设故障修复策略可以事先进行配置,预设故障修复策略中可以包括多种不同的故障范围对应的故障修复策略,每种故障范围对应的故障修复策略可以包括一种或者多个故障修复策略。
可选的,当故障范围为总线故障时,总线故障对应的故障修复策略可以包括以下至少一种:将故障总线切换至冗余总线,将故障总线对应的通道切换至未使用的通道,将故障总线对应的存储裸片切换至未使用的存储裸片,或者降低使用的总线的位宽。
其中,总线可以包括数据总线、地址总线和控制总线;具体的,当故障范围具体为数据总线故障时,数据总线故障对应的故障修复策略可以包括以下至少一种:将故障数据总线切换至冗余数据总线,将故障数据总线对应的通道切换至未使用的通道,将故障数据总线对应的存储裸片切换至未使用的存储裸片,或者降低使用的数据总线的位宽。
需要说明的是,地址总线故障对应的故障修复策略、以及控制总线故障对应的故障修复策略与上述数据总线故障对应的故障修复策略是类似的,具体可以参见数据总线故障对应的故障修复策略的描述,本申请实施例在此不再赘述。
可选的,当故障范围为不同存储粒度的存储空间故障时,该存储空间故障对应的故障修复策略包括以下中的一种:将故障存储空间映射至冗余存储空间,将故障存储空间映射至未使用的存储空间。
其中,不同存储粒度的存储空间故障可以包括存储裸片故障、通道故障、存储库故障、行故障和存储单元故障等。具体的,当故障范围具体为存储库故障时,存储库 故障对应的故障修复策略可以包括以下至少一种:将故障存储库映射至冗余存储库,将故障存储库映射至未使用的存储库。
需要说明的是,除存储库之外的其他不同存储粒度的存储空间故障对应的故障修复策略与上述存储库故障对应的故障修复策略是类似的,具体可以参见存储库故障对应的故障修复策略的描述,本申请实施例在此不再赘述。
进一步的,如图5所示,当存储器发生了某种故障范围时,若存储器中与该故障范围相关的存储数据没有被破坏,该方法还可以包括:S304。
S304:迁移存储器中与该故障范围相关的存储数据。
具体的,当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的通道切换至未使用的通道时,迁移存储中与故障范围相关的存储数据具体可以为:将该故障总线对应的通道中的数据迁移至该未使用的通道;当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的存储裸片切换至未使用的存储裸片时,迁移存储中与故障范围相关的存储数据具体可以为:将该故障总线对应的存储裸片中的数据迁移至该未使用的存储裸片。
具体的,当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至冗余存储空间时,迁移存储中与故障范围相关的存储数据具体可以为:将该故障存储空间中的数据迁移至该冗余存储空间;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至未使用的存储空间时,迁移存储中与故障范围相关的存储数据具体可以为:将该故障存储空间中的数据迁移至该未使用的存储空间。
为便于理解,下面以图6所示的存储系统为例,对本申请实施例提供的技术方案进行举例说明。在图6中,该存储系统包括:存储测试模块401、存储校验模块402、故障分析模块403、故障处理模块404、存储器405和替换资源存储模块406。
其中,存储测试模块401可以根据故障检测算法检测存储器405中的存储单元,比如,存储测试模块401可以根据Checkerboard算法、Gallop算法、March算法和butterfly算法(蝶形法)等检测存储器405中的存储单元,得到至少一个故障。存储校验模块402可以根据存储器405中的存储单元对应的读写操作的数据校验信息,确定至少一个故障,比如,存储校验模块402在存储器405的正常读写过程中,可在写操作时根据写入的数据生成数据校验位,在读操作时根据写入时生成的数据校验位对读出的数据进行校验,从而得到数据校验信息。故障分析模块403可以收集存储测试模块401和/或存储校验模块402确定的至少一个故障,并基于存储单元、行、块和存储库等不同存储粒度分析至少一个故障,以确定存储器的故障范围,比如,对至少一个故障进行故障分类和统计等;故障处理模块404可以根据预设故障修复策略中该故障范围对应的故障修复策略修复存储器405,比如,故障分析模块403可以生成对应的故障修复策略的操作指令并传输至故障处理模块404,故障处理模块404根据接收到的操作指令修复存储器405。替换资源存储模块406中可以包括不同存储粒度的存储空间,比如,替换资源存储模块406中包括冗余存储单元,冗余行、冗余块、冗余存储库等,当该故障范围对应的故障修复策略修复为使用冗余存储空间替换故障存储空间时,可以使用替换资源存储模块406中包括的冗余存储空间来修复存储器405中 的故障存储空间,比如,使用冗余块修复存储器405中的故障块等。
本申请实施例中,控制逻辑通过检测存储器中的故障,并对检测到的故障进行分类和统计等分析操作以得到故障范围,之后根据预设故障修复策略中该故障范围对应的故障修复策略修复存储器,从而在存储器故障时,可以及时对多种不同粒度或不同的故障范围的存储器故障进行有效地修复,以保证存储器的正常使用,进而提高了存储器的正确性和可靠性,同时提高了存储器的使用寿命。
上述主要从存储系统的角度对本申请实施例提供的方案进行了介绍。可以理解的是,存储系统为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的网元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对存储系统进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
图7为本申请实施例提供的一种存储器故障修复装置,该装置包括:检测单元501、分析单元502和修复单元503。其中,检测单元501用于检测该存储器中的存储单元,得到至少一个故障;分析单元502用于基于不同存储粒度分析该至少一个故障,以确定该存储器的故障范围;修复单元503用于根据预设故障修复策略中该故障范围对应的故障修复策略修复该存储器。
其中,至少一个故障包括以下至少一种类型的故障:固定故障SAF、状态转换故障TF、耦合故障CF、相邻模式敏感故障NPSF或者地址译码器故障AF。不同存储粒度包括以下至少两种存储粒度:存储裸片、通道、存储库、面、超级块、块、子块、行、列、页面或者存储单元。该故障范围包括以下至少一项:总线故障、不同存储粒度的存储空间故障。
可选的,该存储器可以为高带宽存储器。
在一种可能的实现方式中,检测单元501具体用于根据故障检测算法生成至少一组读写操作;基于至少一组读写操作读写该存储器中的存储单元,得到至少一个故障。和/或,检测单元501还具体用于根据该存储器中的存储单元对应的读写操作的数据校验信息,确定至少一个故障。
在另一种可能的实现方式中,当该故障范围为总线故障时,该故障范围对应的故障修复策略包括以下中的一种:将故障总线切换至冗余总线,将故障总线对应的通道切换至未使用的通道,将故障总线对应的存储裸片切换至未使用的存储裸片,或者降低使用的总线的位宽;当该故障范围为存储空间故障时,该故障范围对应的故障修复策略包括以下中的一种:将故障存储空间映射至冗余存储空间,将故障存储空间映射至未使用的存储空间。
进一步的,如图7所示,该装置还包括:迁移单元504。
迁移单元504用于:当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的通道切换至未使用的通道时,将该故障总线对应的通道中的数据迁移至该未使用的通道;当该故障范围为总线故障、且该故障范围对应的故障修复策略为该将故障总线对应的存储裸片切换至未使用的存储裸片时,将该故障总线对应的存储裸片中的数据迁移至该未使用的存储裸片;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至冗余存储空间时,将该故障存储空间中的数据迁移至该冗余存储空间;当该故障范围为该存储空间故障、且该故障范围对应的故障修复策略为该将故障存储空间映射至未使用的存储空间时,将该故障存储空间中的数据迁移至该未使用的存储空间。
图8为本申请实施例提供的一种电子设备的结构示意图,该电子设备包括处理器601、存储系统602、通信接口603和总线604,处理器601、存储系统602和通信接口603通过总线604连接,该存储系统602包括控制逻辑和存储器,控制逻辑用于支持该电子设备执行上文所提供的存储器故障修复的方法。可选的,该控制逻辑可以与处理器601集成在一起,也可以与存储器集成在一起,图8中以该控制逻辑与存储器集成在一起为例进行说明。
其中,处理器601可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。上述图8中的总线604可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,上述图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
本申请实施例中,通过检测存储器中的故障,并对检测到的故障进行分类和统计等分析操作以得到故障范围,之后根据预设故障修复策略中该故障范围对应的故障修复策略修复存储器,从而在存储器故障时,可以及时对多种不同粒度或不同的故障范围的存储器故障进行有效地修复,以保证存储器的正常使用,进而提高了存储器的正确性和可靠性,同时提高了存储器的使用寿命。
在本申请的另一实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的至少一个处理器执行该计算机执行指令时,使得该设备执行上文所提供的存储器故障修复的方法。
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;设备的至少一个处理器可以从计算机可读存储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令使得设备实施上文所提供的存储器故障修复的方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功 能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种存储器故障修复方法,其特征在于,应用于存储器中,包括:
    检测所述存储器中的存储单元,得到至少一个故障;
    基于不同存储粒度分析所述至少一个故障,以确定所述存储器的故障范围;
    根据预设故障修复策略中所述故障范围对应的故障修复策略修复所述存储器。
  2. 根据权利要求1所述的方法,其特征在于,所述检测所述存储器中的存储单元,得到至少一个故障,包括:
    根据故障检测算法生成至少一组读写操作;
    基于所述至少一组读写操作读写所述存储器中的存储单元,得到至少一个故障。
  3. 根据权利要求1或2所述的方法,其特征在于,所述检测所述存储器中的存储单元,得到至少一个故障,包括:
    根据所述存储器中的存储单元对应的读写操作的数据校验信息,确定至少一个故障。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述至少一个故障包括以下至少一种类型的故障:固定故障SAF、状态转换故障TF、耦合故障CF、相邻模式敏感故障NPSF或者地址译码器故障AF。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述不同存储粒度包括以下至少两种存储粒度:存储裸片、通道、存储库、面、超级块、块、子块、行(row)、列、页面或者存储单元。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述故障范围包括以下至少一项:总线故障、不同存储粒度的存储空间故障。
  7. 根据权利要求6所述的方法,其特征在于,当所述故障范围为总线故障时,所述故障范围对应的故障修复策略包括以下中的一种:将故障总线切换至冗余总线,将故障总线对应的通道切换至未使用的通道,将故障总线对应的存储裸片切换至未使用的存储裸片,或者降低使用的总线的位宽;
    当所述故障范围为存储空间故障时,所述故障范围对应的故障修复策略包括以下中的一种:将故障存储空间映射至冗余存储空间,将故障存储空间映射至未使用的存储空间。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    当所述故障范围为总线故障、且所述故障范围对应的故障修复策略为所述将故障总线对应的通道切换至未使用的通道时,将所述故障总线对应的通道中的数据迁移至所述未使用的通道;
    当所述故障范围为总线故障、且所述故障范围对应的故障修复策略为所述将故障总线对应的存储裸片切换至未使用的存储裸片时,将所述故障总线对应的存储裸片中的数据迁移至所述未使用的存储裸片;
    当所述故障范围为所述存储空间故障、且所述故障范围对应的故障修复策略为所述将故障存储空间映射至冗余存储空间时,将所述故障存储空间中的数据迁移至所述冗余存储空间;
    当所述故障范围为所述存储空间故障、且所述故障范围对应的故障修复策略为所述将故障存储空间映射至未使用的存储空间时,将所述故障存储空间中的数据迁移至所述未使用的存储空间。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述存储器为高带宽存储器。
  10. 一种存储器故障修复装置,其特征在于,包括:
    检测单元,用于检测所述存储器中的存储单元,得到至少一个故障;
    分析单元,用于基于不同存储粒度分析所述至少一个故障,以确定所述存储器的故障范围;
    修复单元,用于根据预设故障修复策略中所述故障范围对应的故障修复策略修复所述存储器。
  11. 根据权利要求10所述的装置,其特征在于,所述检测单元,具体用于:
    根据故障检测算法生成至少一组读写操作;
    基于所述至少一组读写操作读写所述存储器中的存储单元,得到至少一个故障。
  12. 根据权利要求10或11所述的装置,所述检测单元,还具体用于:
    根据所述存储器中的存储单元对应的读写操作的数据校验信息,确定至少一个故障。
  13. 根据权利要求10-12任一项所述的装置,其特征在于,所述至少一个故障包括以下至少一种类型的故障:固定故障SAF、状态转换故障TF、耦合故障CF、相邻模式敏感故障NPSF或者地址译码器故障AF。
  14. 根据权利要求10-13任一项所述的装置,其特征在于,所述不同存储粒度包括以下至少两种存储粒度:存储裸片、通道、存储库、面、超级块、块、子块、行、列、页面或者存储单元。
  15. 根据权利要求10-14任一项所述的装置,其特征在于,所述故障范围包括以下至少一项:总线故障、不同存储粒度的存储空间故障。
  16. 根据权利要求15所述的装置,其特征在于,当所述故障范围为总线故障时,所述故障范围对应的故障修复策略包括以下中的一种:将故障总线切换至冗余总线,将故障总线对应的通道切换至未使用的通道,将故障总线对应的存储裸片切换至未使用的存储裸片,或者降低使用的总线的位宽;
    当所述故障范围为存储空间故障时,所述故障范围对应的故障修复策略包括以下中的一种:将故障存储空间映射至冗余存储空间,将故障存储空间映射至未使用的存储空间。
  17. 根据权利要求16所述的装置,其特征在于,所述装置还包括迁移单元,用于:
    当所述故障范围为总线故障、且所述故障范围对应的故障修复策略为所述将故障总线对应的通道切换至未使用的通道时,将所述故障总线对应的通道中的数据迁移至所述未使用的通道;
    当所述故障范围为总线故障、且所述故障范围对应的故障修复策略为所述将故障总线对应的存储裸片切换至未使用的存储裸片时,将所述故障总线对应的存储裸片中的数据迁移至所述未使用的存储裸片;
    当所述故障范围为所述存储空间故障、且所述故障范围对应的故障修复策略为所述将故障存储空间映射至冗余存储空间时,将所述故障存储空间中的数据迁移至所述冗余存储空间;
    当所述故障范围为所述存储空间故障、且所述故障范围对应的故障修复策略为所述将故障存储空间映射至未使用的存储空间时,将所述故障存储空间中的数据迁移至所述未使用的存储空间。
  18. 根据权利要求10-17任一项所述的装置,其特征在于,所述存储器为高带宽存储器。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在设备上运行时,使得所述设备执行上述权利要求1-9任一项所述的存储器故障修复方法。
  20. 一种计算机程序产品,其特征在于,当所述计算机程序产品在设备上运行时,使得所述设备执行上述权利要求1-9任一项所述的存储器故障修复方法。
PCT/CN2020/074986 2020-02-13 2020-02-13 一种存储器故障修复方法及装置 WO2021159360A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/074986 WO2021159360A1 (zh) 2020-02-13 2020-02-13 一种存储器故障修复方法及装置
CN202080078352.2A CN114730607A (zh) 2020-02-13 2020-02-13 一种存储器故障修复方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/074986 WO2021159360A1 (zh) 2020-02-13 2020-02-13 一种存储器故障修复方法及装置

Publications (1)

Publication Number Publication Date
WO2021159360A1 true WO2021159360A1 (zh) 2021-08-19

Family

ID=77291867

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/074986 WO2021159360A1 (zh) 2020-02-13 2020-02-13 一种存储器故障修复方法及装置

Country Status (2)

Country Link
CN (1) CN114730607A (zh)
WO (1) WO2021159360A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889176A (zh) * 2021-09-29 2022-01-04 深圳市金泰克半导体有限公司 Ddr芯片的存储单元的测试方法、装置、设备及存储介质
CN115168087A (zh) * 2022-07-08 2022-10-11 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置
CN118351926A (zh) * 2024-06-18 2024-07-16 深圳超盈智能科技有限公司 一种存储芯片的故障测试设备及方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117672328B (zh) * 2024-02-02 2024-04-09 深圳市奥斯珂科技有限公司 固态硬盘的数据恢复方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6574757B1 (en) * 2000-01-28 2003-06-03 Samsung Electronics Co., Ltd. Integrated circuit semiconductor device having built-in self-repair circuit for embedded memory and method for repairing the memory
CN101329918A (zh) * 2008-07-30 2008-12-24 中国科学院计算技术研究所 存储器内建自修复系统及自修复方法
CN106782666A (zh) * 2015-11-25 2017-05-31 北京大学深圳研究生院 一种三维堆叠存储器
CN107704333A (zh) * 2017-10-11 2018-02-16 郑州云海信息技术有限公司 San存储系统的故障保存方法、装置及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6574757B1 (en) * 2000-01-28 2003-06-03 Samsung Electronics Co., Ltd. Integrated circuit semiconductor device having built-in self-repair circuit for embedded memory and method for repairing the memory
CN101329918A (zh) * 2008-07-30 2008-12-24 中国科学院计算技术研究所 存储器内建自修复系统及自修复方法
CN106782666A (zh) * 2015-11-25 2017-05-31 北京大学深圳研究生院 一种三维堆叠存储器
CN107704333A (zh) * 2017-10-11 2018-02-16 郑州云海信息技术有限公司 San存储系统的故障保存方法、装置及可读存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889176A (zh) * 2021-09-29 2022-01-04 深圳市金泰克半导体有限公司 Ddr芯片的存储单元的测试方法、装置、设备及存储介质
CN115168087A (zh) * 2022-07-08 2022-10-11 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置
CN115168087B (zh) * 2022-07-08 2024-03-19 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置
CN118351926A (zh) * 2024-06-18 2024-07-16 深圳超盈智能科技有限公司 一种存储芯片的故障测试设备及方法

Also Published As

Publication number Publication date
CN114730607A (zh) 2022-07-08

Similar Documents

Publication Publication Date Title
WO2021159360A1 (zh) 一种存储器故障修复方法及装置
KR102001800B1 (ko) 재생 회로
Agarwal et al. A process-tolerant cache architecture for improved yield in nanoscale technologies
US11119857B2 (en) Substitute redundant memory
US8566669B2 (en) Memory system and method for generating and transferring parity information
CN101477480B (zh) 内存控制方法、装置及内存读写系统
KR20200074884A (ko) 메모리 테스팅 기법
US11409601B1 (en) Memory device protection
Zhang et al. Dynamic partitioning to mitigate stuck-at faults in emerging memories
Jeong et al. PAIR: Pin-aligned In-DRAM ECC architecture using expandability of Reed-Solomon code
US9891976B2 (en) Error detection circuitry for use with memory
NL2029789B1 (en) Adaptive error correction to improve for system memory reliability, availability, and serviceability (ras)
Patel Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes
US11929136B2 (en) Reference bits test and repair using memory built-in self-test
Li et al. An error detection and correction scheme for RAMs with partial-write function
JP2011238330A (ja) テスト回路及びそれを利用した半導体メモリ装置
US12032443B2 (en) Shadow DRAM with CRC+RAID architecture, system and method for high RAS feature in a CXL drive
CN117079686A (zh) 存储器
WO2023142429A1 (zh) 一种易失性存储介质不可纠正错误的预测方法和相关设备
Jung et al. Predicting Future-System Reliability with a Component-Level DRAM Fault Model
CN105027084B (zh) 在移动通信系统中控制存储器的装置和方法
Lee et al. ECMO: ECC Architecture Reusing Content-Addressable Memories for Obtaining High Reliability in DRAM
CN112102875B (zh) Lpddr测试方法、装置、可读存储介质及电子设备
Agarwal et al. Process variation in nano-scale memories: failure analysis and process tolerant architecture
US20240086090A1 (en) Memory channel disablement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918480

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20918480

Country of ref document: EP

Kind code of ref document: A1