WO2024007765A1 - 一种确定内存故障的修复资源粒度的方法及装置 - Google Patents

一种确定内存故障的修复资源粒度的方法及装置 Download PDF

Info

Publication number
WO2024007765A1
WO2024007765A1 PCT/CN2023/096640 CN2023096640W WO2024007765A1 WO 2024007765 A1 WO2024007765 A1 WO 2024007765A1 CN 2023096640 W CN2023096640 W CN 2023096640W WO 2024007765 A1 WO2024007765 A1 WO 2024007765A1
Authority
WO
WIPO (PCT)
Prior art keywords
repair
fault
information
resource
memory
Prior art date
Application number
PCT/CN2023/096640
Other languages
English (en)
French (fr)
Inventor
张光彪
鲍全洋
曹瑞
韦炜玮
Original Assignee
超聚变数字技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 超聚变数字技术有限公司 filed Critical 超聚变数字技术有限公司
Publication of WO2024007765A1 publication Critical patent/WO2024007765A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Definitions

  • the present application relates to the field of server technology, and in particular, to a method and device for determining the granularity of repair resources for memory faults.
  • DRAM Dynamic random access memory
  • Embodiments of the present application provide a memory fault repair method.
  • small-grained self-healing resources that have almost no impact on performance are prioritized to perform fault repairs, which not only avoids the direct use of large-granular self-healing resource repair methods. It causes a decrease in system performance and leads to obvious business perception, and also achieves full utilization of small-granularity resources.
  • embodiments of the present application provide a method for determining the granularity of repair resources for memory faults.
  • the method is executed by a processing module.
  • the method includes: obtaining fault information of the memory, where the fault information includes the type of fault; determining the fault type. Type; obtain repair resource information, which includes the type of repair resource included in the repair resource set and the repair resource can repair the above fault; determine the repair information based on the fault type and the repair resource information, and the repair information includes the first repair resource At least one of the type or the granularity of the first repair resource, and the first repair resource is the repair resource with the smallest granularity in the repair resource set.
  • the embodiment of the present application provides a method for determining the granularity of repair resources for memory faults by determining to prioritize repair resources with the smallest granularity among the repair resources to repair the fault, so that fault repair has less impact on system performance, and Make full use of repair resources.
  • the above repair resource information also includes the number of repair resources included in the repair resource set, and then the repair information also includes the number of first repair resources.
  • the repair resource information not only includes the type of repair resource, but also includes the quantity of each type of repair resource.
  • the repair information includes the first repair resource, which is the smallest granular repair resource, and the number of the first repair resource. That is, the repair information indicates the type of repair resources used and the number of repair resources of this type for the type of fault that occurs in the memory, so as to repair the memory fault.
  • the method for determining the granularity of repair resources for a memory fault also includes: when the number of first repair resources does not meet the repair resource requirements for the fault, the repair information also includes the type of the second repair resource or the second repair resource. At least one of the granularities of the resource, the second repair resource is the repair resource with the second smallest granularity in the repair resource set.
  • the small-granularity repair resources When the small-granularity repair resources are exhausted, it is determined to use the next smallest-granularity repair resources to repair the memory fault, and so on, by giving priority to the use of repair resources.
  • Small-granularity self-healing resources that have almost no impact on performance perform fault repairs, which not only avoids the direct use of large-granularity self-healing resource repair methods to cause a decrease in system performance and lead to obvious business perception, but also achieves full utilization of small-granularity resources. .
  • the repair information also includes the number of the second repair resources.
  • the fault information includes the first type and quantity of faults
  • the processing module determines the type of fault, including: the processing module determines the second type of fault according to the first type and quantity of faults. For example, if 16 (number) Cell (first type) faults occur at different locations in a bank of memory, the type of the fault is determined to be a Bank (second type) fault.
  • the processing module is a processor or BMC chip.
  • embodiments of the present application provide a memory fault repair method, which includes: a processing module obtains memory fault information, the fault information includes the fault type; the processing module obtains first repair resource information, and the first repair resource information includes the first repair resource information. a type of repair resource, the first repair resource can repair the fault; the processing module determines the first repair information according to the fault information and the first repair resource information, and the first repair information is used to instruct the BIOS to use the second repair resource to repair the fault; The second repair resource is the repair resource with the smallest granularity among the first repair resources; the BIOS obtains the first repair information, and then repairs the fault based on the first repair information.
  • the memory fault repair method provided by the embodiments of the present application not only avoids the degradation of system performance caused by directly using large-granularity self-healing resource repair methods by preferentially using small-granularity self-healing resources that have almost no impact on performance to repair the fault. This leads to obvious business perception and full utilization of fine-grained resources.
  • the fault information includes the first type and number of faults
  • the processing module determines the first repair information based on the fault information and the first repair resource information, including the processing module determining the fault according to the first type and number of faults.
  • the second type is the first type and number of faults
  • the failure type is determined to be a bank failure.
  • the first repair information includes the type of the first repair resource.
  • the first repair information includes the type of the second repair resource, which is a PCLS type repair resource, instructing the BIOS to use the PCLS type repair resource to repair the fault. Repair, in other words, the first repair resource information only indicates the type of the second repair resource, so that the BIOS uses this type of repair resource to repair the fault until the fault is repaired.
  • the first repair information also includes the quantity of the second repair resource, that is to say, the first repair information includes the type and quantity of the second repair resource, for example, the first repair information includes the number of the second repair resource.
  • the type is PCLS type repair resources, and the number is 16, which instructs the BIOS to use 16 PCLS type repair resources to repair the fault to achieve fault repair.
  • the method further includes: when the number of the second repair resources does not meet the repair resource requirements of the fault, determining the second repair information, and the second repair information is used to instruct the BIOS to use the third repair resource to repair the fault.
  • the third repair resource is the repair resource with the second smallest granularity among the second repair resources. That is to say, when the repair resources with the smallest granularity cannot meet the repair resource requirements of the fault, the repair resources with the next smallest granularity are used to continue to repair the fault, and so on, until the fault is repaired.
  • the fault information also includes the memory address of the fault; the BIOS repairs the fault based on the first repair information, including: performing fault repair on the memory corresponding to the memory address based on the first repair information, so as to realize the repair of the fault. repair.
  • the types of faults include Cell fault, Row fault, col fault, Bank fault, One or more of Device failure and Rank failure.
  • the repair resources include at least two of PCLS repair resources, PPR repair resources, ADDDC Sparing repair resources, Device Sparing repair resources, and Rank Sparing repair resources; among which, the PCLS repair resources have a smaller granularity than the PPR repair resources
  • the granularity of PPR repair resources is smaller than the granularity of ADDDC Sparing repair resources.
  • the granularity of ADDDC Sparing repair resources is smaller than the granularity of Device Sparing repair resources.
  • the granularity of Device Sparing repair resources is smaller than the granularity of Rank Sparing repair resources.
  • the processing module is the server's processor, which means that the steps performed by the processing module can be implemented in the processor; or the processing module is the server's out-of-band management single board management controller (BaseBoard Management Controller, BMC). That is, the steps performed by the processing module can be implemented in the BMC.
  • BMC BaseBoard Management Controller
  • embodiments of the present application provide a device for determining the granularity of repair resources for memory faults, including a processing module, where the processing module includes:
  • the first acquisition unit is used to acquire fault information of the memory, where the fault information includes the type of fault;
  • a first determination unit used to determine the type of the fault
  • a second acquisition unit configured to acquire repair resource information, where the repair resource information includes the types of repair resources included in the repair resource set and the repair resources can repair the fault;
  • a second determination unit configured to determine repair information according to the fault type and the repair resource information, where the repair information includes at least one of the type of the first repair resource or the granularity of the first repair resource, and the third A repair resource is the repair resource with the smallest granularity in the repair resource set.
  • the repair resource information further includes the number of repair resources included in the repair resource set, then the repair information further includes the number of the first repair resources.
  • the second determination unit is further configured to: when the number of the first repair resources does not meet the repair resource requirements of the fault, the repair information further includes the type of the second repair resource. Or at least one of the granularities of the second repair resources, where the second repair resources are repair resources with the second smallest granularity in the repair resource set.
  • the repair information further includes the number of the second repair resources.
  • the fault information includes a first type and quantity of faults
  • the first determining unit is specifically configured to:
  • a second type of fault is determined.
  • the processing module is a processor or BMC chip.
  • embodiments of the present application provide a memory fault repair device, which includes: a processing module and a BIOS module;
  • processing module includes:
  • a first acquisition unit configured to acquire fault information of the memory, where the fault information includes the type of fault
  • the first repair resource information includes the type of the first repair resource, and the first repair resource can repair the fault
  • Determining unit configured to determine first repair information according to the fault information and the first repair resource information, the first repair information is used to instruct the BIOS to use the second repair resource to repair the fault, the The second repair resource is the repair resource with the smallest granularity among the first repair resources;
  • the BIOS module includes:
  • a second acquisition unit configured to acquire the first repair information
  • a repair unit configured to repair the fault based on the first repair information.
  • the fault information includes a first type and number of faults;
  • the determining unit is further configured to determine a second type of fault based on the first type and quantity of the fault.
  • the first repair information includes the type of the second repair resource.
  • the first repair information further includes the number of the second repair resources.
  • the determining unit is also used to:
  • second repair information is determined.
  • the second repair information is used to instruct the BIOS to use the third repair resource to repair the fault.
  • the third repair resource is used to repair the fault.
  • the third repair resource is the repair resource with the second smallest granularity among the second repair resources.
  • the fault information also includes the memory address of the fault
  • the repair unit is specifically used for:
  • the type of failure includes one or more of Cell failure, Row failure, col failure, Bank failure, Device failure, and Rank failure.
  • the repair resources include at least two of PCLS repair resources, PPR repair resources, ADDDC Sparing repair resources, Device Sparing repair resources, and Rank Sparing repair resources;
  • the granularity of the PCLS repair resources is smaller than the granularity of the PPR repair resources
  • the granularity of the PPR repair resources is smaller than the granularity of the ADDDC Sparing repair resources
  • the granularity of the ADDDC Sparing repair resources is smaller than the Device Sparing repair resources
  • the granularity of the Device Sparing repair resource is smaller than the granularity of the Rank Sparing repair resource.
  • the processing module is a processor of a server or a processor of a BMC.
  • embodiments of the present application provide a chip, including at least one processor and a communication interface.
  • the processor is configured to execute the method described in the first or second aspect.
  • embodiments of the present application provide a computing device, including a processor and a memory.
  • the processor is configured to execute the method described in the first aspect or the second aspect to repair faults in the memory. .
  • embodiments of the present application provide a computing device, including a processor, a BMC, and a memory. Any one of the processor and the BMC is configured to execute the method described in the first aspect or the second aspect to implement the The memory fault is repaired.
  • embodiments of the present application provide a computing device, including a processing unit, BIOS and memory:
  • the BIOS is used to obtain the first fault information and the first repair resource information of the memory
  • the processing module is used to obtain the second fault information of the memory, the second fault information includes the type of the fault; also determine the type of the fault; and also obtain the second repair resource information, the second repair resource information includes The type of repair resource included in the repair resource set can repair the fault; the repair information is also determined according to the fault type and the second repair resource information, and the repair information includes the first repair resource At least one of the type or the granularity of the first repair resource, the first repair resource being the repair resource with the smallest granularity in the repair resource set;
  • the BIOS is also used to obtain the first repair information; and to repair the fault based on the first repair information.
  • the processing module is a processor or a BMC chip.
  • embodiments of the present application provide a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the method described in the first aspect or the second aspect is implemented. .
  • embodiments of the present application further provide a computer program or computer program product.
  • the computer program or computer program product includes instructions. When the instructions are executed, the computer is caused to perform the method described in the first aspect or the second aspect. .
  • Figure 1 shows a schematic diagram of the hardware structure of a computing device that can execute a memory fault repair method provided by an embodiment of the present application
  • Figure 2 is a flow chart of a method for determining the granularity of repair resources for memory faults provided by an embodiment of the present application
  • Figure 3 is a flow chart of a memory fault repair method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of the implementation process of a memory fault repair method provided by an embodiment of the present application when a memory fault is suddenly reported;
  • Figure 5 is a schematic diagram of the implementation process of a memory fault repair method provided by an embodiment of the present application when memory faults are reported sequentially;
  • FIG. 6 is a schematic structural diagram of a device for determining the granularity of repair resources for memory faults provided by an embodiment of the present application.
  • the server out-of-band management single board management controller collects fault information, carries out fault prediction, and defines fault characteristic modes and repair mechanisms. For example, for Cell faults, use Partial cache data replacement (partial cache line sparing, PCLS) repair resources are used for repair; for Row faults, PPR repair resources are used for repair; for Col faults and Bank faults, ADDDC Sparing repair resources are used for repair; for Device faults, Device Sparing is used Repair resources are used to repair; for Rank faults, use Rank Sparing repair resources to repair; send the corresponding request to the BIOS; the BIOS executes the corresponding isolation mechanism according to the repair request to complete the system repair.
  • Partial cache data replacement Partial cache line sparing, PCLS
  • PPR repair resources are used for repair
  • Col faults and Bank faults ADDDC Sparing repair resources are used for repair
  • Repair resources are used to repair
  • for Rank faults use Rank Sparing repair resources to repair; send the corresponding request to the BIOS; the BIOS executes the corresponding isolation mechanism
  • the main feature of this solution is to implement corresponding self-healing mechanisms for different fault types, but it lacks consideration of the priority use of memory repair resources at different granularities.
  • multiple small-granularity repair resources to perform self-healing for example, multiple PCLS repair resources can be used to perform the repair of the fault
  • the fault repair resources with larger memory granularity such as ADDDC Sparing
  • small-granularity repair resources such as PCLS
  • this solution uses fixed-granularity repair resources for fixed types of faults, without considering the preferential use of repair resources with different granularities, resulting in less detailed use of repair resources, which not only results in the loss of repair resources Waste will also affect system performance, which will lead to obvious business perception.
  • memory includes multiple channels, and each Channel includes multiple dual-in-line memories.
  • Dual inline memory modules DIMM
  • DIMM dual inline memory modules
  • a DIMM includes multiple Ranks (memory modules)
  • a Rank includes multiple Devices (memory particles)
  • a Device includes multiple Banks (memory banks)
  • a Bank includes multiple Row (memory row) or Col (memory column)
  • a Row or Col includes multiple Cells (memory bits).
  • Cell failure, row failure, column Col failure, Bank failure, granular Device failure, and Rank failure are respectively characterized. Memory failures occur at Cell granularity, Row granularity, Col granularity, Bank granularity, Device granularity, and Rank granularity.
  • Granularity is used to measure the size of memory space. For example, the granularity of DIMM is larger than rank, the granularity of rank is larger than Device, the granularity of Device is larger than Bank, the granularity of Bank is larger than Row/Col, and the granularity of Row or Col is larger than Cell.
  • a Cell failure when a Cell in the memory fails, it is called a Cell failure; when a Row/Col fails (for example, if multiple Cell failures occur in a Row/Col, it is determined that the Row/Col has failed) , it is called a Row failure/col failure; when a Bank failure occurs, it is called a Bank failure; when a Device failure occurs, it is called a Device failure; when a Rank failure occurs, it is called a Rank failure;
  • the size of the memory space where the fault occurs can also be divided into different granularities. For example, memory faults with different granularities can be divided into Cell faults, Row faults/col faults, Bank faults, Device faults, and Rank faults from small to large. .
  • Memory is an important part of computing equipment. It usually fails due to various reasons during use. In order to ensure that the operation of the system will not be affected after a memory failure, the memory is usually set up redundantly. For example, there will be redundancy on each bank. Row or Col, when a Row on the Bank fails, the failed Row will be replaced with a redundant Row. We replace the redundant Cell, Row, col, Bank, Device, Rank that may exist in the memory. etc. are called repair resources.
  • Cell repair resources include partial cache line sparing (PCLS) repair resources
  • Row repair resources or Col repair resources include line replacement (post-package repair, PPR) repair resources
  • Bank repair resources include adaptive Type double particle data correction (adaptive double device data correction sparing, ADDDC Sparing) repair resources
  • Device repair resources include Device Sparing repair resources
  • Rank repair resources include Rank Sparing repair resources.
  • granularity measures the size of repair resources. For example, the granularity of PCLS fault repair resources is smaller than that of PPR fault repair resources. The larger the granularity of repair resources, the greater the impact on system performance, and the business perception will be more obvious.
  • the embodiment of the present application proposes a method for repairing memory faults.
  • the repair of memory faults prioritizes the use of small-granularity fault repair resources.
  • the small-granularity fault repair resources are exhausted, , and then use the next smaller-granularity repair resources to repair the memory fault, and so on.
  • the method causes the degradation of system performance and leads to obvious business perception, and also realizes the full utilization of small-granularity resources.
  • FIG. 1 shows a schematic diagram of the hardware structure of a computing device that can execute a memory fault repair method provided by an embodiment of the present application.
  • the computing device includes a processor 105, a memory controller 102, a memory 103, and a basic input output system (BIOS) chip 104.
  • the processor 105 and the memory controller 102 may be integrated together, or may be provided independently.
  • the memory 103 is used to store data required for processor operations, and can also exchange data with external memories such as hard disks in the computing device.
  • the memory 103 can cache operating systems and software applications.
  • the memory controller 102 is used to manage data/programs in the memory 103; the BIOS chip 104 is used to detect various hardware in the computing device 100, such as CPU, memory, and motherboard.
  • the computing device 100 may further include a baseboard management controller 106 (baseboard management controller 106 ).
  • management controller BMC, used for remote management of computing devices and other operations.
  • the computing device may be a terminal device, such as a personal computer, a smartphone, a smart wearable device, etc.
  • the computing device may also be a server, for example, an X86 architecture server, which may be a blade server, a high-density server, a rack server or a high-performance server.
  • X86 architecture server which may be a blade server, a high-density server, a rack server or a high-performance server.
  • the processor 105 may be a central processing unit CPU, and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits) circuit (ASIC), field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSPs digital signal processors
  • ASIC application specific integrated circuits
  • FPGA field programmable gate array
  • a general-purpose processor can be a microprocessor or any conventional processor, etc.
  • the memory 103 is a volatile memory that can be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
  • the computing device 100 can perform the memory fault repair method provided in the embodiment of the present application below.
  • the processor 105 and the BMC 107 can be collectively referred to as a processing module, that is to say, in the following In the article, when the processing module is mentioned, it can be the processor 105 or the BMC 106.
  • the BIOS is used to collect memory fault information (including the location of memory faults, number of faults, fault types, etc.) and fault repair resource information.
  • the processing module is used to determine the memory fault repair strategy based on the fault information and fault repair resource information. , and sends the repair strategy to the BIOS, and the BIOS repairs the memory fault based on the fault repair strategy.
  • the processing module collects memory fault information reported by the BIOS, and collects and counts the number of memory fault repair resources reported by the BIOS.
  • the fault self-healing reasoning and decision-making module in the processing module prioritizes defining a strategy for using fine-grained fault repair resources to perform memory fault repair based on memory fault information and fault repair resource information, and sends the fault repair mechanism request to the BIOS to execute the corresponding Bug fixes.
  • BIOS can be a memory or a functional chip.
  • the functions implemented by the BIOS can be executed by the processor calling a program stored in the BIOS memory, or can be directly executed by the BIOS chip.
  • the computing device shown in Figure 1 is only a schematic diagram of the hardware structure of a computing device that can execute a method for repairing memory faults provided by embodiments of the present application. It does not limit the computing devices to which embodiments of the present application are applicable.
  • the computing device may also include a persistent storage medium, a communication interface, a communication line, etc., which are not shown in Figure 1 .
  • the following uses the computing device as a server as an example to introduce the solution provided by the embodiment of the present application.
  • Other computing devices are similar and will not be described again here.
  • FIG. 2 is a flow chart of a method for determining the repair granularity of memory faults provided by an embodiment of the present application. This method can be executed in the server shown in Figure 1 to determine the repair granularity of memory faults and avoid wasting repair resources. As shown in Figure 2, the embodiment of the present application provides a method for determining the granularity of repair resources for memory faults, which at least includes steps S201 to S206.
  • step S201 the BIOS obtains memory fault information.
  • the BIOS obtains fault information that exists in the memory, including, for example, the type of fault, the memory address of the fault, the number of faults, and other information.
  • step S202 the processing module obtains memory fault information.
  • the processing module can obtain the memory fault information by receiving the memory fault information reported by the BIOS, or the processing module can obtain the memory fault information by sending a request to the BIOS. For example, the processing module sends a request to the BIOS, and the request is used Request to obtain the memory fault information, and the BIOS responds to the request and sends the memory fault information to the processing module.
  • Memory faults can be divided into many different types based on granularity.
  • memory fault types can include one or more of Cell faults, Row faults, Column faults, Bank faults, Granular Device faults, and Rank faults.
  • the embodiments of this application do not limit the specific granularity of memory fault types. This is only an illustrative description. With the advancement of technology, smaller or larger granularity fault types may appear in the future.
  • the processing module can obtain all the fault information obtained by the BIOS, or can obtain at least part of the fault type obtained by the BIOS.
  • the fault information obtained by the BIOS includes the type of fault, the memory address of the fault, the fault type.
  • the processing module can obtain only the type of fault, or the processing module can obtain the type of fault and the number of faults.
  • step S203 the processing module determines the fault type of the memory.
  • the processing module obtains the fault information of the memory.
  • the processing module can determine the type of the memory fault based on the fault type and the number of fault types. For example, the processing module obtains the fault information from the BIOS. If the fault information shows that 16 Cell faults have occurred at different locations in a Bank of the memory, it is determined that the type of memory fault is a Bank fault.
  • the processing module can also directly obtain the fault type from the BIOS. For example, if the fault type reported by the BIOS to the processing module is a Cell fault, the processing module determines that the fault type is a Cell fault. In other words, the processing module processes it in real time. Fault information reported by BIOS.
  • the processing module may also determine the number of memory faults.
  • step S204 the BIOS obtains repair resource information.
  • the BIOS detects the memory to obtain repair resources in the memory, which can be used to repair memory faults.
  • Memory is an important part of computing equipment. It usually fails due to various reasons during use. In order to ensure that the operation of the system will not be affected after a memory failure, the memory is usually set up redundantly. For example, there will be redundancy on each bank. Row or Col, when a Row on the Bank fails, the failed Row will be replaced with a redundant Row. We replace the redundant Cell, Row, col, Bank, Device, Rank that may exist in the memory. etc. are called repair resources.
  • BIOS obtains repair resource information, that is, obtains redundant Cell, Row, col, Bank, Device, and Rank information in memory.
  • the BIOS can detect the memory and obtain the types of all repair resources in the memory.
  • repair resources can include PCLS repair resources, PPR repair resources, ADDDC Sparing repair resources, Device Sparing repair resources, and Rank Sparing.
  • the granularity of PCLS repair resources is smaller than that of PPR repair resources, and the granularity of PPR repair resources is The granularity is smaller than that of ADDDC Sparing repair resources, ADDDC
  • the granularity of Sparing repair resources is smaller than the granularity of Device Sparing repair resources, and the granularity of Device Sparing repair resources is smaller than the granularity of Rank Sparing repair resources.
  • the correspondence between the granularity of memory faults and the granularity of repair resources is usually as follows: Cell faults correspond to PCLS repair resources, Row faults/Column faults correspond to PPR repair resources, Bank faults correspond to ADDDC Sparing repair resources, and granular Device faults correspond to Device Sparing repair resources. , Rank fault corresponds to Rank Sparing repair resources.
  • the BIOS can also obtain the number of all repair resources in memory.
  • the number of PCLS repair resources is 16, the number of PPR repair resources is 8, and the number of ADDDC Sparing repair resources is 4.
  • step S205 the processing module obtains repair resource information.
  • the processing module can obtain the repair resource information by receiving the repair resource information reported by the BIOS.
  • the processing module can also send a get request to the BIOS to obtain repair resource information.
  • the processing module can obtain all the information of the repair resources obtained by the BIOS, and can also obtain all the information of the repair resources obtained by the BIOS, including at least part of the repair resource type.
  • the information of the repair resources obtained by the BIOS includes the type of the repair resources. Type and quantity of repair resources, the processing module can obtain only the type of repair resources, or the processing module can obtain the type and quantity of repair resources.
  • the processing module can send a request to the BIOS to obtain repair resource information with a granularity smaller than or equal to the fault type according to the fault type. For example, if the fault type is Bank, the processing module sends a request to the BIOS to obtain repair resource information with a granularity smaller than or equal to ADDDC. Sparing repair resource granularity repair resource information (including PCLS fault repair resources, PPR fault repair resources and ADDDC Sparing fault repair resources), the repair resource information includes the type of repair resource.
  • the processing module obtains all repair resource information in memory from the BIOS.
  • processing module may also obtain the partial repair resource information obtained by the BIOS, including at least part of the repair resource type.
  • step S206 the processing module determines the repair information that preferentially uses the repair resources of the first granularity to repair the fault, and the first granularity is the minimum granularity.
  • the processing module determines the repair information for this type of fault based on the memory fault information and fault repair resource information.
  • the repair information is to prioritize the use of the smallest granularity repair resource pair among the currently available fault repair resources among the available memory repair resources. This type of fault needs to be repaired.
  • the repair information may include the granularity of the repair resource; optionally, the repair information may include the type of the repair resource, such as ADDDC Sparing, or the repair information may include the granularity and type of the repair resource; optionally, the repair information may also include the repair resource
  • the granularity and quantity, or the repair information may also include the type and quantity of the repair resources, or the repair information may also include the granularity, type and quantity of the repair resources.
  • the memory fault type is a bank fault
  • the obtained repair resources include PCLS repair resources, PPR repair resources, and ADDDC Sparing repair resources
  • the memory fault information also includes the faulty memory address
  • the repair information may also include the faulty memory address
  • the repair information for using the minimum-granularity repair resources to repair the fault is determined.
  • the repair resources with the second smallest granularity can be used to repair the fault, and so on, until until the fault is repaired.
  • Figure 3 is a flow chart of a memory fault repair method provided by an embodiment of the present application. This method can be executed in the server shown in Figure 1 to achieve the repair of memory faults. As shown in Figure 3, it includes at least step S301 to step S303.
  • step 301 the processing module determines the repair information of the memory fault repair resource. Specifically, steps 201 to 206 shown in FIG. 3 may be used, which will not be described again here.
  • step S302 the BIOS obtains repair information.
  • the processing module After the processing module determines the repair information, it sends the repair information to the BIOS.
  • the repair information is used to instruct the BIOS to perform a repair strategy for memory faults.
  • step S303 the BIOS repairs the fault based on the repair information.
  • the processing module sends the repair information to the BIOS, so that the BIOS repairs the fault based on the repair information.
  • the BIOS can use the repair resources of the corresponding granularity multiple times to repair the fault until the fault repair is completed.
  • the BIOS can use the corresponding type of repair resource multiple times to repair the fault until the fault repair is completed.
  • the BIOS can use the repair resources of the corresponding granularity multiple times to repair the fault until the fault repair is completed, or the BIOS can use the repair resources of the corresponding type multiple times to repair the fault until the fault repair is completed.
  • the BIOS can use the corresponding number of repair resources of the corresponding granularity to repair the fault.
  • the BIOS can use the corresponding number of repair resources of the corresponding type to repair the fault. For example, if the repair information sent by the processing module to the BIOS is PCLS repair resources and 8, then the BIOS uses 8 PCLS to repair the fault.
  • the BIOS can use the corresponding number of repair resources of the corresponding granularity to repair the fault, or the BIOS can use the corresponding number of repair resources of the corresponding type to repair the fault.
  • the BIOS can perform fault repair on the memory corresponding to the memory address.
  • the BIOS detects a fault and reports the fault information to the processing module.
  • the processing module receives the fault information.
  • the BIOS detects the repair resource information in the memory.
  • the processing module determines the repair information based on the fault information and the repair resource information. , and sends the repair information to the BIOS, and the BIOS repairs the fault based on the repair information. That is to say, when the BIOS detects multiple faults, there will be multiple interactions between the processing module and the BIOS.
  • the processing module will send repair information for each reported fault information, and the BIOS will execute the repair information to The fault is repaired.
  • the BIOS detects multiple faults (for example, 16 times) and reports the detected multiple fault information to the processing module.
  • the processing module receives the fault information and determines the fault type based on the fault information. For example, in a If 17 Cell faults are reported in the Bank, the fault type is determined to be a Bank fault; the BIOS detects the repair resource information in the memory, and the processing module determines the repair information based on the fault type and repair resource information, and sends the repair information to the BIOS.
  • BIOS Repair this type of fault based on the repair information. That is to say, when the BIOS detects multiple faults, it will collect the multiple faults and report them to the processing module.
  • the processing module determines the fault based on the reported fault information. type, and then sends repair information for the fault type, and the BIOS executes the repair information to repair the fault of this type.
  • the BIOS obtains fault information in the memory by detecting the memory. For example, 17 Cell faults occurred on Bank0;
  • the BIOS reports the obtained memory fault information to the BMC
  • BMC collected 17 fault information reported by BIOS and confirmed it to be a bank fault
  • BIOS detects the fault self-healing resource information in the memory and sends the fault self-healing resource information to the BMC;
  • BMC collects the fault self-healing resource information reported by BIOS, and counts the remaining number of fault self-healing resources PCLS repair resources as 16, and the remaining number of online PPR repair resources as 2;
  • the BMC performs fault self-healing reasoning decisions based on the above fault type information and self-healing resource information, determines the repair strategy for PCLS for the first 16 faults, and sends the repair strategy to the BIOS;
  • the BMC performs fault self-healing reasoning decisions based on the above fault type information and self-healing resource information. It is uncertain whether to directly execute the bank-level ADDDC Sparing repair strategy, determine the repair strategy to perform online PPR for the 17th fault, and use this repair strategy Sent to BIOS;
  • BIOS executes the above repair strategy, it completes the fault repair of the memory.
  • the memory fault repair method provided by the embodiment of this application is to report sudden memory faults.
  • the memory fault is a bank fault.
  • the bank-level ADDDC Sparing fault repair method is not directly used. Instead, small-grained sparing methods that have almost no impact on performance are given priority. PCLS and online PPR repair existing faults in the bank, which not only avoids the degradation of system performance caused by the ADDDC Sparing repair method and leads to obvious business perception, but also achieves full utilization of small-granularity repair resources.
  • memory faults are reported sequentially. Assume that the remaining number of bit-level self-healing resources PCLS in the system is 16, and the remaining number of row-level self-healing resources online PPR is 2. Then, the specific implementation process of this solution is shown in Figure 5. Shown:
  • the BIOS obtains fault information in the memory by detecting the memory. For example, 16 Cell faults occurred on Bank0;
  • the BIOS reports the obtained memory fault information to the BMC
  • BMC collects 16 fault information reported by BIOS in sequence
  • BIOS detects the fault self-healing resource information in the memory and sends the fault self-healing resource information to the BMC;
  • BMC collects the fault self-healing resource information reported by the BIOS and counts the remaining number of fault self-healing resource PCLS repair resources as 16;
  • the BMC performs fault self-healing reasoning decisions based on the above fault type information and self-healing resource information, prioritizes the PCLS repair strategy for 16 faults in sequence, and sends the repair strategy to the BIOS;
  • the BIOS continues to detect the memory and obtains fault information in the memory. For example, two Cell faults occurred on Bank0;
  • the BIOS reports the obtained memory fault information to the BMC
  • BMC collects the two fault information reported by BIOS in sequence
  • BMC collects fault self-healing resource information reported by BIOS and counts the remaining number of fault self-healing resources online PPR repair resources The amount is 2;
  • the BMC performs fault self-healing reasoning decisions based on the above fault type information and self-healing resource information, prioritizes the online PPR repair strategy for the 2 faults in sequence, and sends the repair strategy to the BIOS;
  • BMC sequentially collects information about continued memory failures reported by the BIOS
  • BMC collects other fault self-healing resource information reported by BIOS and counts that other fault self-healing resources are sufficient (for example, other fault self-healing resources are ADDDC Sparing);
  • the BMC performs fault self-healing reasoning decisions based on the above fault type information and self-healing resource information, determines the self-healing strategy that matches other faults (such as Bank fault), and sends the repair strategy (for example, the repair strategy is ADDDC Sparing) to the BIOS;
  • BIOS executes the above repair strategy, it completes the fault repair of the memory.
  • the memory fault repair method provided by the embodiment of the present application when the memory fails sequentially, gives priority to using fine-grained self-healing resources that have almost no impact on performance to perform fault repairs, which not only avoids the direct use of large-grained self-healing resource repair means It causes a decrease in system performance and leads to obvious business perception, and also achieves full utilization of fine-grained resources.
  • the embodiment of the present application also provides a memory fault repair device 600.
  • the memory fault repair device 600 includes a device for realizing the steps shown in Figures 2-5. The various steps in the memory failure repair method of a unit or module.
  • the processing module is the processor of the server.
  • the execution steps in the BMC above can be executed in the processor of the server. Therefore, when the processing module is the processor At this time, the specific implementation of the memory fault repair method provided by the embodiment of the present application can be referred to the above description, and for the sake of simplicity, it will not be described again here.
  • FIG. 6 is a schematic structural diagram of a device for determining the repair granularity of memory faults provided by an embodiment of the present application. This device can be executed by a processing module. As shown in Figure 6, the device 600 for determining the repair resource granularity of memory faults at least includes:
  • processing module 601 includes: processing module 601;
  • the processing module 601 includes:
  • the first obtaining unit 6011 is used to obtain fault information of the memory, where the fault information includes the type of fault;
  • the first determining unit 6012 is used to determine the type of the fault
  • the second acquisition unit 6013 is used to acquire repair resource information, where the repair resource information includes the type of repair resources included in the repair resource set and the repair resources can repair the fault;
  • the second determining unit 6014 is configured to determine repair information according to the fault type and the repair resource information.
  • the repair information includes at least one of the type of the first repair resource or the granularity of the first repair resource.
  • the first repair resource is the repair resource with the smallest granularity in the repair resource set.
  • the repair resource information further includes the number of repair resources included in the repair resource set, then the repair information further includes the number of the first repair resources.
  • the second determining unit 6014 is further configured to: when the number of the first repair resources does not meet the repair resource requirements of the fault, the repair information further includes the number of the second repair resources. At least one of a type or a granularity of the second repair resource, where the second repair resource is a repair resource with the second smallest granularity in the repair resource set.
  • the repair information further includes the number of the second repair resources.
  • the fault information includes the first type and number of faults, and the first determining unit 6012 is specifically configured to:
  • a second type of fault is determined.
  • the processing module is a processor or BMC chip.
  • the memory fault repair device 600 may correspond to executing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the memory fault repair device 600 are respectively implemented. The corresponding processes of each method in Figure 2-5 are not repeated here for the sake of brevity.
  • the method for determining the repair granularity of memory faults provided by the embodiments of the present application can be executed by the processor of the server.
  • the processor calls program code.
  • the program code includes one or more software modules.
  • the computing device uses the processor to Execute the program code to implement a method for determining the repair granularity of memory faults provided by the embodiment of this application.
  • the memory fault repair method provided by the embodiment of the present application can be executed by the out-of-band management BMC in Figure 2.
  • the out-of-band management BMC stores the program code corresponding to the method of determining the repair granularity of the memory fault provided by the embodiment of the present application.
  • the program code includes one or more software modules, and the computing device executes the program code through the out-of-band management BMC to implement a method for determining the repair granularity of memory faults provided by embodiments of the present application.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored. When the computer instructions are executed by a processor, the above-mentioned method is implemented.
  • Embodiments of the present application provide a chip, which includes at least one processor and an interface.
  • the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to Implement the methods mentioned above.
  • Embodiments of the present application provide a computer program or computer program product, which includes instructions that, when executed, cause the computer to perform the above-mentioned method.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Hardware Redundancy (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

提供一种内存故障的修复方法,包括:处理模块获取内存的故障信息,故障信息包括故障的类型;处理模块获取第一修复资源信息,第一修复资源信息包括第一修复资源的类型,第一修复资源能修复故障;处理模块根据故障信息和第一修复资源信息确定第一修复信息,第一修复信息用于指示BIOS使用第二修复资源对故障进行修复,第二修复资源为第一修复资源中粒度最小的修复资源;BIOS基于第一修复信息,对故障进行修复。本申请提供的内存故障的修复方法,优先使用对性能几乎无影响的细粒度的自愈资源执行故障的修复,不仅避免了直接使用大粒度的自愈资源修复手段造成系统性能的下降而导致业务的感知明显,还实现了细粒度资源的充分利用。

Description

一种确定内存故障的修复资源粒度的方法及装置
本申请要求于2022年07月08日提交的申请号为202210797526.1、申请名称为“一种确定内存故障的修复资源粒度的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及服务器技术领域,尤其涉及一种确定内存故障的修复资源粒度的方法及装置。
背景技术
动态随机存取存储器(dynamic random access memory,DRAM)是一种在存储和IT领域有广泛应用的随机存取存储器。随着DRAM的集成度越高、制程越小,基础失效率也越来越高,内存故障已经成为造成服务器宕机的重要原因之一。
发明内容
本申请的实施例提供一种内存故障的修复方法,针对内存故障,优先使用对性能几乎无影响的小粒度的自愈资源执行故障的修复,不仅避免了直接使用大粒度的自愈资源修复手段造成系统性能的下降而导致业务的感知明显,还实现了小粒度资源的充分利用。
第一方面,本申请实施例提供一种确定内存故障的修复资源粒度的方法,该方法由处理模块执行,所述方法包括:获取内存的故障信息,该故障信息包括故障的类型;确定故障的类型;获取修复资源信息,该修复资源信息包括修复资源集合中包括的修复资源的类型且该修复资源能修复上述故障;根据故障类型和修复资源信息确定修复信息,修复信息包括第一修复资源的类型或者第一修复资源的粒度中的至少一个,第一修复资源为修复资源集合中粒度最小的修复资源。
也就是说,本申请实施例提供一种确定内存故障的修复资源粒度的方法通过确定优先采用修复资源中的最小粒度的修复资源对故障进行修复,以实现故障修复对系统性能影响较小,以及修复资源的充分利用。
在一个可能的实现中,上述修复资源信息还包括修复资源集合中包括的修复资源的数量,则修复信息还包括第一修复资源的数量。
例如,修复资源信息不但包括修复资源的类型,还包括各个类型的修复资源的数量,相对应的,修复信息包括第一修复资源,也就是最小粒度的修复资源,以及第一修复资源的数量,即修复信息指示了针对内存出现的故障类型的,采用的修复资源的类型以及该类型修复资源的数量,以实现对内存故障的修复。
在另一个可能的实现中,确定内存故障的修复资源粒度的方法还包括:当第一修复资源的数量不满足故障的修复资源需求时,修复信息还包括第二修复资源的类型或者第二修复资源的粒度中的至少一个,第二修复资源为修复资源集合中第二小粒度的修复资源。
换言之,在本可能的实现中,确定优先使用小粒度的修复资源,当小粒度修复资源耗尽后,再确定使用次小粒度的修复资源对内存故障执行修复,依次类推,通过优先使用对 性能几乎无影响的小粒度的自愈资源执行故障的修复,不仅避免了直接使用大粒度的自愈资源修复手段造成系统性能的下降而导致业务的感知明显,还实现了小粒度资源的充分利用。
在另一个可能的实现中,修复信息还包括所述第二修复资源的数量。
在另一个可能的实现中,故障信息包括故障的第一类型和数量,则处理模块确定故障的类型,包括:处理模块根据故障的第一类型和数量,确定故障的第二类型。例如,内存的一个Bank中的不同位置发生了16(数量)次Cell(第一类型)故障,则确定该故障的类型为Bank(第二类型)故障。
在另一个可能的实现中,所述处理模块为处理器或BMC芯片。
第二方面,本申请实施例提供一种内存故障的修复方法,包括:处理模块获取内存的故障信息,故障信息包括故障的类型;处理模块获取第一修复资源信息,第一修复资源信息包括第一修复资源的类型,第一修复资源能修复故障;处理模块根据故障信息和第一修复资源信息确定第一修复信息,第一修复信息用于指示BIOS使用第二修复资源对故障进行修复,第二修复资源为第一修复资源中粒度最小的修复资源;BIOS获取第一修复信息,然后基于第一修复信息,对故障进行修复。
本申请实施例提供的内存故障的修复方法,通过优先使用对性能几乎无影响的小粒度的自愈资源执行故障的修复,不仅避免了直接使用大粒度的自愈资源修复手段造成系统性能的下降而导致业务的感知明显,还实现了细粒度资源的充分利用。
在一个可能的实现中,故障信息包括故障的第一类型和数量,则处理模块根据故障信息和第一修复资源信息确定第一修复信息,包括处理模块根据故障的第一类型和数量,确定故障的第二类型。
例如,内存的一个Bank上发生了16次Cell故障,则确定故障类型为bank故障。
在另一个可能的实现中,第一修复信息包括第一修复资源的类型,例如第一修复信息包括第二修复资源的类型为PCLS类型的修复资源,指示BIOS采用PCLS类型的修复资源对故障进行修复,换言之,第一修复资源信息只指示第二修复资源的类型,使BIOS采用该类型的修复资源对故障进行修复,直至将故障修复完成。
在另一个可能的实现中,第一修复信息还包括第二修复资源的数量,也就是说,第一修复信息包括第二修复资源的类型和数量,例如第一修复信息包括第二修复资源的类型为PCLS类型的修复资源,数量为16,指示BIOS采用16个PCLS类型的修复资源对故障进行修复即可实现故障的修复。
在另一个可能的实现中,还包括:当第二修复资源的数量不满足故障的修复资源需求时,则确定第二修复信息,第二修复信息用于指示BIOS使用第三修复资源对故障进行修复,第三修复资源为第二修复资源中第二小粒度的修复资源。也就是说,当最小粒度的修复资源不能满足故障的修复资源需求时,采用次小粒度的修复资源继续对故障进行修复,依次类推,直至实现对故障的修复。
在另一个可能的实现中,故障信息还包括故障的内存地址;BIOS基于第一修复信息,对故障进行修复,包括:基于第一修复信息对内存地址对应的内存执行故障修复,实现对故障的修复。
在另一个可能的实现中,故障的类型包括Cell故障、Row故障、col故障、Bank故障、 Device故障、Rank故障中的一种或多种。
在另一个可能的实现中,修复资源包括PCLS修复资源、PPR修复资源、ADDDC Sparing修复资源、Device Sparing修复资源、Rank Sparing修复资源中的至少两种;其中,PCLS修复资源的粒度小于PPR修复资源的粒度,PPR修复资源的粒度小于ADDDC Sparing修复资源的粒度,ADDDC Sparing修复资源的粒度小于Device Sparing修复资源的粒度,Device Sparing修复资源的粒度小于Rank Sparing修复资源的粒度。
可选的,处理模块为服务器的处理器,也就是说处理模块执行的步骤可以在处理器中执行实现;或者处理模块为服务器的带外管理单板管理控制器(BaseBoard Management Controller,BMC),即处理模块执行的步骤可以在BMC中执行实现。
第三方面,本申请实施例提供一种确定内存故障的修复资源粒度的装置,包括处理模块,其中处理模块包括:
第一获取单元,用于获取内存的故障信息,所述故障信息包括故障的类型;
第一确定单元,用于确定所述故障的类型;
第二获取单元,用于获取修复资源信息,所述修复资源信息包括修复资源集合中包括的修复资源的类型所述修复资源能修复所述故障;
第二确定单元,用于根据所述故障类型和所述修复资源信息确定修复信息,所述修复信息包括第一修复资源的类型或者所述第一修复资源的粒度中的至少一个,所述第一修复资源为所述修复资源集合中粒度最小的修复资源。
在一个可能的实现中,所述修复资源信息还包括所述修复资源集合中包括的修复资源的数量,则所述修复信息还包括所述第一修复资源的数量。
在另一个可能的实现中,所述第二确定单元还用于:当所述第一修复资源的数量不满足所述故障的修复资源需求时,所述修复信息还包括第二修复资源的类型或者所述第二修复资源的粒度中的至少一个,所述第二修复资源为所述修复资源集合中第二小粒度的修复资源。
在另一个可能的实现中,所述修复信息还包括所述第二修复资源的数量。
在另一个可能的实现中,所述故障信息包括故障的第一类型和数量,所述第一确定单元具体用于:
根据所述故障的第一类型和数量,确定故障的第二类型。
在另一个可能的实现中,所述处理模块为处理器或BMC芯片。
第四方面,本申请实施例提供一种内存故障的修复装置,包括:处理模块和BIOS模块;
其中,所述处理模块包括:
第一获取单元,用于获取内存的故障信息,所述故障信息包括故障的类型;以及
获取第一修复资源信息,所述第一修复资源信息包括所述第一修复资源的类型,所述第一修复资源能修复所述故障;
确定单元,用于根据所述故障信息和所述第一修复资源信息确定第一修复信息,所述第一修复信息用于指示所述BIOS使用第二修复资源对所述故障进行修复,所述第二修复资源为所述第一修复资源中粒度最小的修复资源;
所述BIOS模块包括:
第二获取单元,用于获取所述第一修复信息;
修复单元,用于基于所述第一修复信息,对所述故障进行修复。在一个可能的实现中,所述故障信息包括故障的第一类型和数量;
所述确定单元还用于,根据所述故障的第一类型和数量,确定故障的第二类型。
在另一个可能的实现中,所述第一修复信息包括所述第二修复资源的类型。
在另一个可能的实现中,所述第一修复信息还包括所述第二修复资源的数量。在另一个可能的实现中,所述确定单元还用于:
当所述第二修复资源的数量不满足所述故障的修复资源需求时,则确定第二修复信息,第二修复信息用于指示BIOS使用第三修复资源对所述故障进行修复,所述第三修复资源为所述第二修复资源中第二小粒度的修复资源。
在另一个可能的实现中,
所述故障信息还包括所述故障的内存地址;
所述修复单元具体用于:
基于所述第一修复信息对所述内存地址对应的内存执行故障修复,实现对所述故障的修复。
在另一个可能的实现中,所述故障的类型包括Cell故障、Row故障、col故障、Bank故障、Device故障、Rank故障中的一种或多种。
在另一个可能的实现中,所述修复资源包括PCLS修复资源、PPR修复资源、ADDDC Sparing修复资源、Device Sparing修复资源、Rank Sparing修复资源中的至少两种;
其中,所述PCLS修复资源的粒度小于所述PPR修复资源的粒度,所述PPR修复资源的粒度小于所述ADDDC Sparing修复资源的粒度,所述ADDDC Sparing修复资源的粒度小于所述Device Sparing修复资源的粒度,所述Device Sparing修复资源的粒度小于所述Rank Sparing修复资源的粒度。
在另一个可能的实现中,所述处理模块为服务器的处理器或BMC的处理器。
第五方面,本申请实施例提供一种芯片,包括至少一个处理器和通信接口,所述处理器用于执行第一方面或第二方面所述的方法。
第六方面,本申请实施例提供一种计算设备,包括处理器,和内存,所述处理器用于执行第一方面或第二方面所述的方法,以实现对所述内存中的故障进行修复。
第七方面,本申请实施例提供一种计算设备,包括处理器、BMC和内存,所述处理器和BMC中的任意一个用于执行第一方面或第二方面所述的方法,以实现对所述内存中的故障进行修复。
第八方面,本申请实施例提供一种计算设备,包括处理单元、BIOS和内存:
所述BIOS,用于获取所述内存的第一故障信息和第一修复资源信息;
所述处理模块,用于获取内存的第二故障信息,所述第二故障信息包括故障的类型;还确定所述故障的类型;还获取第二修复资源信息,所述第二修复资源信息包括修复资源集合中包括的修复资源的类型所述修复资源能修复所述故障;还根据所述故障类型和所述第二所述修复资源信息确定修复信息,所述修复信息包括第一修复资源的类型或者所述第一修复资源的粒度中的至少一个,所述第一修复资源为所述修复资源集合中粒度最小的修复资源;
所述BIOS,还用于获取所述第一修复信息;并用于基于所述第一修复信息对所述故障进行修复。
在一个可能的实现中,所述处理模块为处理器或者BMC芯片。
第九方面,本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在被处理器执行时,使得第一方面或第二方面所述的方法被实现。
第十方面,本申请实施例还提供一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括指令,当所述指令执行时,令计算机执行第一方面或第二方面所述的方法。
附图说明
图1示出了可执行本申请实施例提供的一种内存故障的修复方法的计算设备的硬件结构示意图;
图2为本申请实施例提供的一种确定内存故障的修复资源粒度的方法的流程图;
图3为本申请实施例提供的一种内存故障的修复方法的流程图;
图4为本申请实施例提供的一种内存故障的修复方法在内存故障突发上报的情况下的实现过程示意图;
图5为本申请实施例提供的一种内存故障的修复方法在内存故障依次上报的情况下的实现过程示意图;
图6为本申请实施例提供的一种确定内存故障的修复资源粒度的装置的结构示意图。
具体实施方式
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
为了解决内存故障问题,一种方案为:服务器带外管理单板管理控制器(BaseBoard Management Controller,BMC)收集故障信息,开展故障预测,定义故障特征模式和修复机制,例如,针对Cell故障,使用部分缓存数据替换(partial cache line sparing,PCLS)修复资源进行修复;针对Row故障,使用PPR修复资源进行修复;针对Col故障和Bank故障,使用ADDDC Sparing修复资源进行修复;针对Device故障,使用Device Sparing修复资源进行修复;针对Rank故障,使用Rank Sparing修复资源进行修复;发送给BIOS执行相应的请求;BIOS根据修复请求,执行相应的隔离机制,以完成系统的修复。
该方案的主要特点为针对不同的故障类型,执行相对应的自愈机制,却缺少了对内存不同粒度修复资源优先使用的考虑。针对可以使用多个小粒度的修复资源执行自愈的故障(例如可以使用多个PCLS修复资源执行该故障的修复),当优先使用了内存较大粒度的故障修复资源(例如ADDDC Sparing)执行该故障的修复,将导致小粒度的修复资源(例如PCLS)不可用,不仅造成剩余小粒度修复资源的浪费,还会影响系统性能导致业务的感知明显。
也就是说,该方案在执行对内存故障的修复时,固定类型的故障执行固定粒度的修复资源,没有考虑不同粒度修复资源优先使用的情况,导致修复资源的使用欠细致,不仅造成修复资源的浪费,还会影响系统性能,进而导致业务的感知明显。
可以理解的是,内存包括多个channel(通道),每个Channel包括多个双列直插式存 储模块(dual inline memory modules,DIMM),一个DIMM包括多个Rank(内存模组),一个Rank包括多个Device(内存颗粒),一个Device包括多个Bank(内存库),一个Bank包括多个Row(内存行)或Col(内存列),一个Row或Col包括多个Cell(内存位)。Cell故障、行Row故障、列Col故障、Bank故障、颗粒Device故障、Rank故障分别表征,内存在Cell粒度、Row粒度、Col粒度、Bank粒度、Device粒度、Rank粒度上发生了故障。粒度用于度量内存空间的大小,例如DIMM的粒度大于rank,rank的粒度大于Device,Device的粒度大于Bank,Bank的粒度大于Row/Col,Row或Col的粒度大于Cell。
通常,当内存的某个Cell出现了故障,则称之为Cell故障;当Row/Col出现了故障(例如某个Row/Col出现了多个Cell故障,则确定该Row/Col出现了故障),则称之为Row故障/col故障;当Bank出现了故障,则称之为Bank故障;当Device出现了故障,则称之为Device故障;当Rank出现了故障,则称之为Rank故障;相应的将出现故障的内存空间大小,也可以将内存故障分为不同的粒度,例如,内存不同粒度故障从小到大可分为Cell故障、Row故障/col故障、Bank故障、Device故障、Rank故障。
内存是计算设备的重要组成部分,在使用过程中通常会由于各种原因发送故障,为了保证内存发生故障后不影响系统的运行,内存通常会冗余设置,例如每个Bank上会有冗余Row或Col,当Bank上的某个Row出现了故障,会用冗余的Row去替换掉发生故障的Row,我们把内存中可能存在的冗余的Cell、Row、col、Bank、Device、Rank等称之为修复资源。对应地,Cell修复资源包括部分缓存数据替换(partial cache line sparing,PCLS)修复资源,Row修复资源或者Col修复资源修复包括行替换(post-package repair,PPR)修复资源、Bank修复资源包括自适应型双颗粒数据纠正(adaptive double device data correction sparing,ADDDC Sparing)修复资源、Device修复资源包括Device Sparing修复资源、Rank修复资源包括Rank Sparing修复资源。粒度在描述修复资源时,是度量修复资源的大小,例如PCLS故障修复资源的粒度小于PPR故障修复资源。粒度越大的修复资源对系统性能的影响越大,业务感知会越明显。
针对包括上述一种方案以及现有技术存在的问题,本申请实施例提出一种内存故障的修复方法,对内存故障的修复优先使用小粒度的故障修复资源,当小粒度故障修复资源耗尽后,再使用次小粒度的修复资源对内存故障执行修复,依次类推,通过优先使用对性能几乎无影响的小粒度的自愈资源执行故障的修复,不仅避免了直接使用大粒度的自愈资源修复手段造成系统性能的下降而导致业务的感知明显,还实现了小粒度资源的充分利用。
需要解释的是,若无特殊说明,本申请实施例上下文提及的“修复资源”和“自愈资源”含义相同,“修复”和“自愈”的含义相同。
图1示出了可执行本申请实施例提供的一种内存故障的修复方法的计算设备的硬件结构示意图。如图1所示,该计算设备包括处理器105、内存控制器102、内存103和基本输入输出系统(basic input output system,BIOS)芯片104。处理器105和内存控制器102可以是集成在一起的,也可以是独立设置的。内存103用于存放处理器运算所需要的数据,并且还能够与计算设备中的硬盘等外部存储器进行数据交换,例如,内存103中可以缓存操作系统和软件应用程序等。内存控制器102用于对内存103中的数据/程序进行管理;BIOS芯片104用于对计算设备100中各个硬件进行检测,例如CPU、内存和主板等。
在另一种可能的实施例中,计算设备100还可以包括基板管理控制器106(baseboard  management controller,BMC),用于对计算设备进行远程管理等操作。
该计算设备可以是终端设备,例如,个人电脑、智能手机、智能穿戴设备等。
该计算设备也可以是服务器,例如,X86架构的服务器,具体可以是刀片服务器、高密服务器、机架服务器或高性能服务器等。
应理解,在本申请实施例中,处理器105可以是中央处理单元CPU,该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
内存103为易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
应理解,根据本申请实施例的计算设备100可以执行实现下文中本申请实施例提供的内存故障的修复方法,为了便于描述,可以将处理器105以及BMC107合称为处理模块,也就是说在下文中,提到处理模块时,可以是处理器105,或者也可以是BMC106。
具体地,BIOS用于收集内存故障信息(包括内存故障发生的位置,故障数量、故障类型等)和故障修复资源信息,处理模块用于基于故障信息和故障修复资源信息,确定内存故障的修复策略,并将该修复策略发送至BIOS,BIOS基于故障修复策略对内存的故障进行修复。
例如,处理模块收集BIOS上报的内存故障信息,收集并统计BIOS上报的内存故障修复资源的数量。处理模块中的故障自愈推理决策模块根据内存故障信息以及故障修复资源信息,优先定义使用细粒度的故障修复资源执行内存故障修复的策略,并将该故障修复机制请求发送给BIOS,执行相应的故障修复。
需要说明的是,BIOS可以为一个存储器,也可以为一个具有功能的芯片。为了便于说明,在本申请后文中,BIOS实现的功能可以由处理器调用BIOS存储器中存储的程序执行,也可以由BIOS芯片直接执行。
图1示出的计算设备仅为可执行本申请实施例提供的一种内存故障的修复方法的计算设备的硬件结构的一种示意图,其不对本申请实施例所适用的计算设备构成限定,例如,该计算设备还可以包括持久性存储介质、通信接口、通信线路等,图1中未示出。
下面以计算设备为服务器为例介绍本申请实施例提供的方案,其他计算设备与之类似,这里不再赘述。
图2为本申请实施例提供的一种确定内存故障的修复粒度的方法的流程图。该方法可以在图1所示的服务器中执行,以实现对内存故障的修复粒度的确定,避免浪费修复资源。如图2所示,本申请实施例提供的一种确定内存故障的修复资源粒度的方法,至少包括步骤S201至步骤S206。
在步骤S201中,BIOS获取内存的故障信息。
BIOS获取内存中存在的故障信息,例如包括,故障的类型、故障的内存地址、故障的数量等信息。
在步骤S202中,处理模块获取内存的故障信息。
处理模块可以通过接收BIOS上报内存的故障信息的方式获取内存的故障信息,或者,处理模块可以通过向BIOS发送请求的方式获取内存的故障信息,例如,处理模块向BIOS发送请求,该请求用于请求获取内存的故障信息,BIOS响应该请求,向处理模块发送内存的故障信息。
内存的故障可以按照粒度划分为多种不同的类型,例如,内存的故障类型可以包括Cell故障、行Row故障、列col故障、Bank故障、颗粒Device故障、Rank故障中的一种或多种。本申请实施例对于内存的故障类型的具体粒度不做限定,在此仅是示例性说明,随着技术的进步,未来可能会出现更小或更大粒度的故障类型。
需要说明的是,处理模块可以获取BIOS获得的全部故障信息,也可以获取BIOS获得的全部故障信息中至少包括故障类型的一部分,例如BIOS获取的故障信息包括故障的类型、故障的内存地址、故障的数量,处理模块可以只获取故障的类型,或者,处理模块可以获取故障的类型,以及故障的数量。
在步骤S203中,处理模块确定内存的故障类型。
处理模块通过获取到的内存的故障信息,当故障信息包括故障类型以及该故障类型的数量时,处理模块可以根据故障类型和故障类型的数量确定内存故障的类型,例如,处理模块从BIOS获取的故障信息为在内存的一个Bank中的不同位置发生了16次Cell故障,则确定该内存故障的类型为Bank故障。
在另一个示例中,处理模块还可以直接从BIOS中直接获取故障类型,例如,BIOS上报给处理模块的故障类型为Cell故障,则处理模块就确定故障类型为Cell故障,换言之,处理模块实时处理BIOS上报的故障信息。
可选地,处理模块还可以确定内存故障的数量。
在步骤S204中,BIOS获取修复资源的信息。
BIOS对内存进行检测,以获取内存中的修复资源,该修复资源可以用于对内存故障进行修复。
内存是计算设备的重要组成部分,在使用过程中通常会由于各种原因发送故障,为了保证内存发生故障后不影响系统的运行,内存通常会冗余设置,例如每个Bank上会有冗余Row或Col,当Bank上的某个Row出现了故障,会用冗余的Row去替换掉发生故障的Row,我们把内存中可能存在的冗余的Cell、Row、col、Bank、Device、Rank等称之为修复资源。
BIOS获取修复资源信息,也就是获取内存中的冗余的Cell、Row、col、Bank、Device、Rank信息。
BIOS可以通过对内存进行检测,获取内存中所有修复资源的类型。
容易理解的是,修复资源的类型可以包括PCLS修复资源、PPR修复资源、ADDDC Sparing修复资源、Device Sparing修复资源、Rank Sparing,其中,PCLS修复资源的粒度小于PPR修复资源的粒度,PPR修复资源的粒度小于ADDDC Sparing修复资源的粒度,ADDDC  Sparing修复资源的粒度小于Device Sparing修复资源的粒度,Device Sparing修复资源的粒度小于Rank Sparing修复资源的粒度。内存故障的粒度与修复资源的粒度对应关系,通常为:Cell故障对应PCLS修复资源、行Row故障/列Col故障对应PPR修复资源、Bank故障对应ADDDC Sparing修复资源、颗粒Device故障对应Device Sparing修复资源、Rank故障对应Rank Sparing修复资源。
可选地,BIOS还可以获取内存中所有修复资源的数量。例如,例如PCLS修复资源数量为16个,PPR修复资源数量为8个,ADDDC Sparing修复资源数量为4个。
在步骤S205中,处理模块获取修复资源的信息。
处理模块可以通过接收BIOS上报的修复资源信息的方式获取修复资源信息。
处理模块也可以向BIOS发送获取请求,以获取修复资源信息。
需要说明的是,处理模块可以获取BIOS获得的修复资源的全部信息,也可以获取BIOS获得的修复资源的全部信息中至少包括修复资源类型的一部分,例如BIOS获取的修复资源的信息包括修复资源的类型、修复资源的数量,处理模块可以只获取修复资源的类型,或者,处理模块可以获取修复资源的类型,以及修复资源的数量。
在一个示例中,处理模块可以根据故障类型,向BIOS发送获取粒度小于或等于故障类型的粒度对应粒度的修复资源信息的请求,例如故障类型为Bank,则处理模块向BIOS发送获取小于或等于ADDDC Sparing修复资源粒度的修复资源的信息(包括PCLS故障修复资源、PPR故障修复资源和ADDDC Sparing故障修复资源),修复资源信息包括修复资源的类型。
在另一个示例中,处理模块从BIOS获取内存中的所有修复资源信息。
需要说明的是,处理模块也可以获取BIOS获得的部分修复资源的信息中至少包括修复资源类型的一部分。
在步骤S206中,处理模块确定优先使用第一粒度的修复资源对故障进行修复的修复信息,第一粒度为最小粒度。
处理模块根据内存的故障信息和故障修复资源信息,确定针对该类型的故障的修复信息,该修复信息为在具有的内存修复资源中,优先使用当前可用的故障修复资源中最小粒度的修复资源对该类型的故障进行修复。修复信息可以包括该修复资源的粒度;可选地,修复信息可以包括修复资源的类型,例如ADDDC Sparing,或者修复信息可以包括修复资源的粒度和类型;可选地,修复信息还可以包括修复资源的粒度和数量,或者,修复信息还可以包括修复资源的类型和数量,或者修复信息还可以包括修复资源的粒度、类型和数量。
例如,当内存的故障类型为Bank故障时,以及当获取到的修复资源包括PCLS修复资源、PPR修复资源、ADDDC Sparing修复资源时,则确定优先使用最小粒度的PCLS修复资源对Bank故障进行修复。
在另一个示例中,内存的故障信息还包括故障的内存地址,则修复信息中也可以包括故障的内存地址。
在一个示例中,当最小粒度的修复资源的数量满足故障的修复资源需求时,则确定使用最小粒度的修复资源对故障进行修复的修复信息。当最小粒度的修复资源的数量不满足故障的修复资源需求时,可以使用第二小粒度的修复资源对故障进行修复,依次类推,直 至故障修复完成为止。
图3为本申请实施例提供的一种内存故障的修复方法的流程图。该方法可以在图1所示的服务器中执行,以实现对内存故障的修复。如图3所示,至少包括步骤S301至步骤S303。
在步骤301中,处理模块确定内存故障的修复资源的修复信息。具体地可以采用图3所示的步骤201至步骤206,在此不再赘述。
在步骤S302中,BIOS获取修复信息。
当处理模块确定修复信息后,将该修复信息发送给BIOS,该修复信息用于指示BIOS对内存故障执行的修复策略。
在步骤S303中,BIOS基于修复信息,对故障进行修复。
处理模块将修复信息发送给BIOS,使BIOS基于该修复信息,对故障进行修复。
当修复信息为修复资源的粒度时,BIOS可以多次使用对应粒度的修复资源修复故障,直至故障修复完成。
当修复信息为修复资源的类型时,BIOS可以多次使用对应类型的修复资源修复故障,直至故障修复完成。
当修复信息包括修复资源的粒度和类型时,BIOS可以多次使用对应粒度的修复资源修复故障,直至故障修复完成,或者BIOS可以多次使用对应类型的修复资源修复故障,直至故障修复完成。
当修复信息包括修复资源的粒度和数量时,BIOS可以使用对应数量的对应粒度的修复资源修复故障。
当修复信息包括修复资源的类型和数量时,BIOS可以使用对应数量的对应类型的修复资源修复故障。例如,处理模块向BIOS发送的修复信息为PCLS修复资源和8,则BIOS采用8个数量的PCLS对故障进行修复。
当修复信息包括修复资源的粒度、类型和数量时,BIOS可以使用对应数量的对应粒度的修复资源修复故障,或者BIOS可以使用对应数量的对应类型的修复资源修复故障。
当修复信息包括故障的内存地址时,BIOS可以对所述内存地址对应的内存执行故障修复。
在一个示例中,BIOS检测到一次故障即将该故障的信息上报给处理模块,处理模块接收该故障信息,BIOS检测内存中的修复资源信息,处理模块根据该次故障信息和修复资源信息确定修复信息,并将该修复信息发送给BIOS,BIOS根据修复信息对故障进行修复。也就是说,当BIOS检测到多次故障时,处理模块和BIOS之间会有多次交互,处理模块针对每次上报的故障信息会发送针对该故障信息的修复信息,BIOS执行该修复信息对该次故障进行修复。
在另一个示例中,BIOS检测到多次故障(例如16次),将检测到的多次故障信息上报给处理模块,处理模块接收该故障信息,处理模块根据故障信息确定故障类型,例如在一个Bank内上报了17次Cell故障,则确定该故障类型为Bank故障;BIOS检测内存中的修复资源信息,处理模块根据故障类型和修复资源信息确定修复信息,并将该修复信息发送给BIOS,BIOS根据修复信息对该类型的故障进行修复。也就是说,当BIOS检测到多次故障时,会将多次故障集中到一起上报给处理模块,处理模块针对上报的故障信息确定故障 类型,然后发送针对该故障类型的修复信息,BIOS执行该修复信息对该类型的故障进行修复。
下面通过两个示例介绍本申请实施例提供的内存故障的修复方法在内存的故障实际发生时的具体实现。
在一个示例中,内存故障为突发上报,假设系统中存在bit级别自愈资源PCLS剩余数量为16,row级别自愈资源剩余数量在线PPR为2,在内存的同一Bank0上发生了17次Cell故障,那么,该方案的流程具体实施如图4所示:
BIOS通过对内存检测,获取内存中的故障信息,例如Bank0上发生了17次Cell故障;
BIOS将获取到的内存故障信息上报给BMC;
BMC收集BIOS上报的17次故障信息,确认为Bank故障;
BIOS检测内存中的故障自愈资源信息,并将故障自愈资源信息发送给BMC;
BMC收集BIOS上报的故障自愈资源信息,统计故障自愈资源PCLS修复资源剩余数量为16,在线PPR修复资源剩余数量为2;
BMC中根据上述故障类型信息以及自愈资源信息执行故障自愈推理决策,确定对前16次故障执行PCLS的修复策略,并将该修复策略发送给BIOS;
BMC中根据上述故障类型信息以及自愈资源信息执行故障自愈推理决策,不确定直接执行Bank级别的ADDDC Sparing的修复策略,确定对第17次故障执行在线PPR的修复策略,并将该修复策略发送给BIOS;
BIOS执行完上述修复策略后,完成对该内存的故障修复。
本申请实施例提供的内存故障的修复方法,针对内存突发上报故障,内存故障为bank故障,没有直接使用Bank级别的ADDDC Sparing故障修复手段,而是优先使用对性能几乎无影响的小粒度的PCLS和在线PPR对该bank中已存在的故障执行修复,不仅避免了ADDDC Sparing修复手段造成系统性能的下降而导致业务的感知明显,还实现了小粒度修复资源的充分利用。
在另一个示例中,内存故障为依次上报,假设系统中存在bit级别自愈资源PCLS剩余数量为16,row级别自愈资源在线PPR数量剩余为2,那么,该方案的流程具体实施如图5所示:
BIOS通过对内存检测,获取内存中的故障信息,例如Bank0上发生了16次Cell故障;
BIOS将获取到的内存故障信息上报给BMC;
BMC依次收集BIOS上报的16次故障信息;
BIOS检测内存中的故障自愈资源信息,并将故障自愈资源信息发送给BMC;
BMC收集BIOS上报的故障自愈资源信息,统计故障自愈资源PCLS修复资源剩余数量为16;
BMC中根据上述故障类型信息以及自愈资源信息执行故障自愈推理决策,优先依次对16次故障执行PCLS的修复策略,并将该修复策略发送给BIOS;
BIOS继续对内存检测,获取内存中的故障信息,例如Bank0上发生了2次Cell故障;
BIOS将获取到的内存故障信息上报给BMC;
BMC依次收集BIOS上报的2次故障信息;
BMC收集BIOS上报的故障自愈资源信息,统计故障自愈资源在线PPR修复资源剩余数 量为2;
BMC中根据上述故障类型信息以及自愈资源信息执行故障自愈推理决策,优先依次对2次故障执行在线PPR的修复策略,并将该修复策略发送给BIOS;
BMC依次收集BIOS上报的内存继续发生故障的信息;
BMC收集BIOS上报的其他故障自愈资源信息,统计其他故障自愈资源充足(例如其他故障自愈资源为ADDDC Sparing);
BMC中根据上述故障类型信息以及自愈资源信息执行故障自愈推理决策,确定符合其他故障的自愈策略(例如Bank故障),并将该修复策略(例如修复策略为ADDDC Sparing)发送给BIOS;
BIOS执行完上述修复策略后,完成对该内存的故障修复。
本申请实施例提供的内存故障的修复方法,针对内存依次发生故障,优先使用对性能几乎无影响的细粒度的自愈资源执行故障的修复,不仅避免了直接使用大粒度的自愈资源修复手段造成系统性能的下降而导致业务的感知明显,还实现了细粒度资源的充分利用。
与前述的内存故障的修复方法的实施例基于相同的构思,本申请实施例中还提供了一种内存故障的修复装置600,该内存故障的修复装置600包括用以实现图2-5所示的内存故障的修复方法中的各个步骤的单元或模块。
当计算设备中无BMC时,例如服务器中无BMC时,则处理模块为服务器的处理器,上文中在BMC的执行步骤则可在服务器的处理器中执行实现,因此,当处理模块为处理器时,本申请实施例提供的内存故障的修复方法的具体实现可参见上文描述,为了简洁,这里不再赘述。
图6为本申请实施例提供的一种确定内存故障的修复粒度的装置的结构示意图。该装置可以由处理模块执行,如图6所示,该一种确定内存故障的修复资源粒度的装置600至少包括:
包括:处理模块601;
其中,所述处理模块601包括:
第一获取单元6011,用于获取内存的故障信息,所述故障信息包括故障的类型;
第一确定单元6012,用于确定所述故障的类型;
第二获取单元6013,用于获取修复资源信息,所述修复资源信息包括修复资源集合中包括的修复资源的类型所述修复资源能修复所述故障;
第二确定单元6014,用于根据所述故障类型和所述修复资源信息确定修复信息,所述修复信息包括第一修复资源的类型或者所述第一修复资源的粒度中的至少一个,所述第一修复资源为所述修复资源集合中粒度最小的修复资源。
在一个可能的实现中,所述修复资源信息还包括所述修复资源集合中包括的修复资源的数量,则所述修复信息还包括所述第一修复资源的数量。
在另一个可能的实现中,所述第二确定单元6014还用于:当所述第一修复资源的数量不满足所述故障的修复资源需求时,所述修复信息还包括第二修复资源的类型或者所述第二修复资源的粒度中的至少一个,所述第二修复资源为所述修复资源集合中第二小粒度的修复资源。
在另一个可能的实现中,所述修复信息还包括所述第二修复资源的数量。
在另一个可能的实现中,所述故障信息包括故障的第一类型和数量,所述第一确定单元6012具体用于:
根据所述故障的第一类型和数量,确定故障的第二类型。
在另一个可能的实现中,所述处理模块为处理器或BMC芯片。根据本申请实施例的内存故障的修复装置600可对应于执行本申请实施例中描述的方法,并且一种内存故障的修复装置600中的各个模块的上述和其它操作和/或功能分别为了实现图2-5中的各个方法的相应流程,为了简洁,在此不再赘述。
需要解释的是,本申请实施例提供的确定内存故障的修复粒度的方法可以由服务器的处理器执行,例如处理器调用程序代码,该程序代码包括一个或多个软件模块,计算设备通过处理器执行程序代码,实现本申请实施例提供的一种确定内存故障的修复粒度的方法。
本申请实施例提供的内存故障的修复方法可以由图2中的带外管理BMC执行,例如带外管理BMC中存储有本申请实施例提供的确定内存故障的修复粒度的方法对应的程序代码,该程序代码包括一个或多个软件模块,计算设备通过带外管理BMC执行程序代码,实现本申请实施例提供的一种确定内存故障的修复粒度的方法。
本申请的实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机指令在被处理器执行时,使得上文提及的方法被实现。
本申请的实施例提供了一种芯片,该芯片包括至少一个处理器和接口,所述至少一个处理器通过所述接口确定程序指令或者数据;该至少一个处理器用于执行所述程序指令,以实现上文提及的方法。
本申请的实施例提供了一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括指令,当该指令执行时,令计算机执行上文提及的方法。
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (17)

  1. 一种确定内存故障的修复资源粒度的方法,其特征在于,所述方法由处理模块执行,所述方法包括:
    所述处理模块获取内存的故障信息,所述故障信息包括故障的类型;
    所述处理模块确定所述故障的类型;
    所述处理模块获取修复资源信息,所述修复资源信息包括修复资源集合中的修复资源的类型;
    所述处理模块根据所述故障类型和所述修复资源信息确定修复信息,所述修复信息包括第一修复资源的类型或者所述第一修复资源的粒度中的至少一个,所述第一修复资源为所述修复资源集合中粒度最小的修复资源。
  2. 根据权利要求1所述的方法,其特征在于,所述修复资源信息还包括所述修复资源集合中的修复资源的数量,则所述修复信息还包括所述第一修复资源的数量。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:当所述第一修复资源的数量不满足所述故障的修复资源需求时,所述修复信息还包括第二修复资源的类型或者所述第二修复资源的粒度中的至少一个,所述第二修复资源为所述修复资源集合中第二小粒度的修复资源。
  4. 根据权利要求3所述的方法,其特征在于,所述修复信息还包括所述第二修复资源的数量。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述故障信息包括故障的第一类型和数量,则所述处理模块确定所述故障的类型,包括:
    所述处理模块根据所述故障的第一类型和数量,确定故障的第二类型。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述处理模块为处理器或BMC芯片。
  7. 一种确定内存故障的修复资源粒度的装置,其特征在于,包括:处理模块;
    其中,所述处理模块包括:
    第一获取单元,用于获取内存的故障信息,所述故障信息包括故障的类型;
    第一确定单元,用于确定所述故障的类型;
    第二获取单元,用于获取修复资源信息,所述修复资源信息包括修复资源集合中包括的修复资源的类型所述修复资源能修复所述故障;
    第二确定单元,用于根据所述故障类型和所述修复资源信息确定修复信息,所述修复信息包括第一修复资源的类型或者所述第一修复资源的粒度中的至少一个,所述第一修复资源为所述修复资源集合中粒度最小的修复资源。
  8. 根据权利要求7所述的装置,其特征在于,所述修复资源信息还包括所述修复资源集合中包括的修复资源的数量,则所述修复信息还包括所述第一修复资源的数量。
  9. 根据权利要求8所述的装置,其特征在于,所述第二确定单元还用于:当所述第一修复资源的数量不满足所述故障的修复资源需求时,所述修复信息还包括第二修复资源的类型或者所述第二修复资源的粒度中的至少一个,所述第二修复资源为所述修复资源集合中第二小粒度的修复资源。
  10. 根据权利要求9所述的装置,其特征在于,
    所述修复信息还包括所述第二修复资源的数量。
  11. 根据权利要求7-10任一项所述的装置,其特征在于,
    所述故障信息包括故障的第一类型和数量,所述第一确定单元具体用于:
    根据所述故障的第一类型和数量,确定故障的第二类型。
  12. 根据权利要求7-11任一项所述的装置,其特征在于,所述处理模块为处理器或BMC芯片。
  13. 一种芯片,其特征在于,包括至少一个处理器和通信接口,所述处理器用于执行如权利要求1-6中任一项所述的方法。
  14. 一种计算设备,其特征在于,包括处理器和内存,所述处理器用于执行如权利要求1-6中任一项所述的方法,以实现对所述内存中的故障进行修复。
  15. 一种计算设备,其特征在于,包括处理器、BMC和内存,所述处理器和BMC中的任意一个用于执行如权利要求1-6任一项所述的方法,以实现对所述内存中的故障进行修复。
  16. 一种计算设备,包括处理单元、BIOS和内存:其特征在于,
    所述BIOS,用于获取所述内存的第一故障信息和第一修复资源信息;
    所述处理模块,用于获取内存的第二故障信息,所述第二故障信息包括故障的类型;还确定所述故障的类型;还获取第二修复资源信息,所述第二修复资源信息包括修复资源集合中包括的修复资源的类型所述修复资源能修复所述故障;还根据所述故障类型和所述第二所述修复资源信息确定修复信息,所述修复信息包括第一修复资源的类型或者所述第一修复资源的粒度中的至少一个,所述第一修复资源为所述修复资源集合中粒度最小的修复资源;
    所述BIOS,还用于获取所述第一修复信息;并用于基于所述第一修复信息对所述故障进行修复。
  17. 根据权利要求16所述的计算设备,其特征在于,所述处理模块为处理器或者BMC芯片。
PCT/CN2023/096640 2022-07-08 2023-05-26 一种确定内存故障的修复资源粒度的方法及装置 WO2024007765A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210797526.1 2022-07-08
CN202210797526.1A CN115168087B (zh) 2022-07-08 2022-07-08 一种确定内存故障的修复资源粒度的方法及装置

Publications (1)

Publication Number Publication Date
WO2024007765A1 true WO2024007765A1 (zh) 2024-01-11

Family

ID=83491492

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096640 WO2024007765A1 (zh) 2022-07-08 2023-05-26 一种确定内存故障的修复资源粒度的方法及装置

Country Status (2)

Country Link
CN (2) CN118295838A (zh)
WO (1) WO2024007765A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118295838A (zh) * 2022-07-08 2024-07-05 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置
CN115686901B (zh) * 2022-10-25 2023-08-04 超聚变数字技术有限公司 内存故障分析方法及计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291591A (zh) * 2017-06-14 2017-10-24 郑州云海信息技术有限公司 一种存储故障修复方法及装置
US20190370132A1 (en) * 2018-05-31 2019-12-05 International Business Machines Corporation Disaster recovery and replication in disaggregated datacenters
CN113821364A (zh) * 2020-06-20 2021-12-21 华为技术有限公司 内存故障的处理方法、装置、设备及存储介质
CN114064333A (zh) * 2020-08-05 2022-02-18 华为技术有限公司 一种内存故障处理方法和装置
CN115168087A (zh) * 2022-07-08 2022-10-11 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104903864B (zh) * 2012-11-02 2018-09-04 慧与发展有限责任合伙企业 选择性错误校正码和存储器访问粒度切换
WO2021134628A1 (zh) * 2019-12-31 2021-07-08 华为技术有限公司 一种存储器的失效修复方法及装置
WO2021159360A1 (zh) * 2020-02-13 2021-08-19 华为技术有限公司 一种存储器故障修复方法及装置
CN113835923A (zh) * 2020-06-24 2021-12-24 华为技术有限公司 一种复位系统、数据处理系统以及相关设备
CN112506710B (zh) * 2020-12-16 2024-02-23 深信服科技股份有限公司 分布式文件系统数据修复方法、装置、设备及存储介质
CN113282434B (zh) * 2021-07-19 2021-10-29 苏州浪潮智能科技有限公司 一种基于封装后修复技术的内存修复方法及相关组件

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291591A (zh) * 2017-06-14 2017-10-24 郑州云海信息技术有限公司 一种存储故障修复方法及装置
US20190370132A1 (en) * 2018-05-31 2019-12-05 International Business Machines Corporation Disaster recovery and replication in disaggregated datacenters
CN113821364A (zh) * 2020-06-20 2021-12-21 华为技术有限公司 内存故障的处理方法、装置、设备及存储介质
CN114064333A (zh) * 2020-08-05 2022-02-18 华为技术有限公司 一种内存故障处理方法和装置
CN115168087A (zh) * 2022-07-08 2022-10-11 超聚变数字技术有限公司 一种确定内存故障的修复资源粒度的方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIE YUANJIANG, WANG DA, HU YU, LI XIAOWEI: "Memory BISR Based on Content Addressable Memory", JOURNAL OF COMPUTER-AIDED DESIGN & COMPUTER GRAPHICS, GAI-KAN BIAN-WEI-HUI, BEIJING, CN, vol. 21, no. 4, 30 April 2009 (2009-04-30), CN , pages 467 - 473, XP009551689, ISSN: 1003-9775 *

Also Published As

Publication number Publication date
CN115168087A (zh) 2022-10-11
CN115168087B (zh) 2024-03-19
CN118295838A (zh) 2024-07-05

Similar Documents

Publication Publication Date Title
WO2024007765A1 (zh) 一种确定内存故障的修复资源粒度的方法及装置
KR102451163B1 (ko) 반도체 메모리 장치 및 그것의 리페어 방법
WO2021253708A1 (zh) 内存故障的处理方法、装置、设备及存储介质
US9606889B1 (en) Systems and methods for detecting memory faults in real-time via SMI tests
US9092349B2 (en) Storage of codeword portions
US20160055059A1 (en) Memory devices and modules
US9065481B2 (en) Bad wordline/array detection in memory
CN104685474B (zh) 用于处理不可纠正的内存错误的方法及非瞬态处理器可读介质
DE112011106030T5 (de) Selbstreparaturlogik für eine Stapelspeicherarchitektur
US9645904B2 (en) Dynamic cache row fail accumulation due to catastrophic failure
US8433950B2 (en) System to determine fault tolerance in an integrated circuit and associated methods
US20170169902A1 (en) Systems, methods, and computer programs for resolving dram defects
CN112579342B (zh) 内存纠错方法、内存控制器及电子设备
CN115168088A (zh) 一种针对内存的不可纠正错误的修复方法及装置
US20220004451A1 (en) System and method for tracking memory corrected errors by frequency of occurrence while reducing dynamic memory allocation
CN115328684A (zh) 内存故障的上报方法、bmc及电子设备
US8261137B2 (en) Apparatus, a method and a program thereof
CN117971539A (zh) 一种内存故障处理方法、计算设备及管理平台
US20240013851A1 (en) Data line (dq) sparing with adaptive error correction coding (ecc) mode switching
CN115705261A (zh) 内存故障的修复方法、cpu、os、bios及服务器
US12032443B2 (en) Shadow DRAM with CRC+RAID architecture, system and method for high RAS feature in a CXL drive
CN116401085A (zh) 内存异常处理方法、设备及存储介质
CN115686901B (zh) 内存故障分析方法及计算机设备
US20230205626A1 (en) Multilevel memory failure bypass
CN116483630A (zh) 内存故障修复方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834539

Country of ref document: EP

Kind code of ref document: A1