WO2023193396A1 - 一种内存故障处理方法、装置及计算机可读存储介质 - Google Patents

一种内存故障处理方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2023193396A1
WO2023193396A1 PCT/CN2022/115340 CN2022115340W WO2023193396A1 WO 2023193396 A1 WO2023193396 A1 WO 2023193396A1 CN 2022115340 W CN2022115340 W CN 2022115340W WO 2023193396 A1 WO2023193396 A1 WO 2023193396A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
fault
memory address
physical memory
fault information
Prior art date
Application number
PCT/CN2022/115340
Other languages
English (en)
French (fr)
Inventor
张玉峰
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023193396A1 publication Critical patent/WO2023193396A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Definitions

  • the present application relates to the field of computer technology, and in particular to a memory failure processing method, device and computer-readable storage medium.
  • Server memory is also called memory (Random Access Memory, RAM). It has some unique technologies, such as error checking and correction (Error Correcting Code, ECC), etc., thus having extremely high stability and error correction performance.
  • ECC Error Correcting Code
  • Modern operating systems do not access server memory directly, but through an intermediate layer. This intermediate layer is called virtual memory (VM) in the operating system; the operating system accesses the memory through the VM. The corresponding physical memory mapped by the VM. At the same time, the physical memory address of the virtual memory map can also be changed, so that the operating system can access the physical memory address.
  • VM virtual memory
  • server hardware fault diagnosis and fault prediction are technical difficulties in the field of server operation and maintenance. Among them, server failures caused by memory account for the highest proportion of all failures. Currently, there is a lack of effective solutions for diagnosing server memory failures.
  • the embodiment of this application provides a memory fault handling method, including:
  • An embodiment of the present application also provides a memory fault handling device, including:
  • the monitoring module is used to monitor the fault information of the server's memory to confirm that the memory has failed;
  • the first acquisition module is used to acquire the redundant space of the memory
  • a judgment module used to judge whether the redundant space is less than the first threshold, and trigger the second acquisition module in response to the redundant space being not less than the first threshold;
  • the second acquisition module is used to obtain the faulty physical memory address and its corresponding virtual memory address based on the fault information
  • the redundancy module is used to isolate the faulty physical memory address through the memory redundancy mechanism and obtain a new physical memory address; where the space of the new physical memory address is equal to the space size of the faulty physical memory address;
  • the data backup module is used to back up the data in the faulty physical memory address corresponding to the virtual memory address
  • the mapping module is used to map virtual memory addresses to new physical memory addresses for migrating data to new physical memory addresses.
  • the embodiment of the present application also provides yet another memory fault processing device, including:
  • a processor configured to execute computer-readable instructions to implement the memory fault handling method in any embodiment.
  • Embodiments of the present application also provide one or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, they cause the one or more processors to execute The steps of the memory fault handling method in any embodiment.
  • Figure 1 is a flow chart of a memory fault handling method according to one or more embodiments
  • Figure 2 is a flow chart of yet another memory fault handling method according to one or more embodiments.
  • Figure 3 is a schematic structural diagram of a memory fault handling device according to one or more embodiments.
  • Figure 4 is a schematic structural diagram of yet another memory fault handling device according to one or more embodiments.
  • One of the cores of the embodiments of the present application is to provide reliable memory failure processing methods, devices and computer-readable storage media.
  • FIG. 1 is a flow chart of a memory fault handling method provided by an embodiment of the present application. As shown in Figure 1, memory fault handling methods can include:
  • Step S10 Monitor the fault information of the server's memory to confirm that the memory has failed.
  • Step S11 Obtain the redundant space of the memory.
  • Step S12 Determine whether the redundant space is smaller than the first threshold; if not, proceed to step S13.
  • Step S13 Obtain the faulty physical memory address and its corresponding virtual memory address according to the fault information.
  • Step S14 Isolate the faulty physical memory address through the memory redundancy mechanism, and obtain a new physical memory address; where the space of the new physical memory address is equal to the space size of the faulty physical memory address.
  • Step S15 Back up the data in the faulty physical memory address corresponding to the virtual memory address.
  • Step S16 Map the virtual memory address to a new physical memory address for migrating data to the new physical memory address.
  • memory may fail for a number of reasons while the server is running.
  • memory faults are divided into two categories, one is correctable error (CE), and the other is uncorrected error (UCE).
  • CE correctable error
  • UCE uncorrected error
  • the memory can automatically correct it through the ECC mechanism, but excessive or frequent CE may often indicate the occurrence of UCE; and if UCE occurs in the memory, it will generally be accompanied by server downtime, which is a serious server failure. Therefore, after discovering CE, we must handle it accordingly according to the situation. Therefore, in this embodiment, the fault information of the memory is first monitored, the CE as the fault information is discovered, and the occurrence of CE is followed up to adopt corresponding strategies to avoid the occurrence of UCE. In this embodiment, there is no restriction on the monitoring method of fault information, and it depends on the specific implementation situation.
  • the redundant space of the memory After obtaining the memory failure information, obtain the redundant space of the memory. It is understandable that when memory manufacturers produce memory, in order to prevent part of the physical space of the memory from being damaged and causing the memory to become unusable, the method used is memory space redundancy. For example, for a nominal 128M memory particle, the actual usable memory space may be 130M; and the extra 2M is the redundant space of the memory. Before the memory leaves the factory, the manufacturer will conduct a comprehensive test of the memory to find damaged areas of normal physical memory, and then redirect the damaged physical memory space to an area of the same size as the redundant physical memory space through memory firmware address encoding. This ensures that all 128M of space can be used.
  • the redundant space of the failed memory it is determined whether it is smaller than the first threshold. In this embodiment, there is no restriction on the first threshold, and it depends on the specific implementation situation. If the redundant space is not less than the first threshold, it is determined that the redundant space of the memory is sufficient, and subsequent redundant operations can be performed.
  • VM virtual memory
  • MMU memory Management Unit
  • CPU Central Processing Unit
  • the VM and the real physical memory address There is a mapping relationship between physical addresses.
  • the operating system will divide the memory into multiple spaces and assign them to different programs for use.
  • Application programs use the memory through the virtual memory address space. Therefore, the faulty physical memory address of the faulty memory also has its corresponding virtual memory address.
  • PPR Post Package Repair
  • This is a memory repair method.
  • PPR technology can replace some damaged lines in the memory with redundant lines to achieve memory redundancy; the space of the new physical memory address is equal to the space of the faulty physical memory address.
  • the data of the faulty physical memory address can be stored.
  • the data in the faulty physical memory address corresponding to the obtained virtual memory address is backed up to prevent data loss in the faulty physical memory address.
  • the virtual memory address is mapped to a new physical memory address, which is used to migrate the backed-up data in the faulty physical memory address to the new physical memory address, and finally realizes the processing of the faulty memory.
  • the redundant space of the memory is obtained, and it is judged whether the redundant space is less than the first threshold; if not, the faulty physical memory address and its corresponding virtual memory address are obtained according to the fault information; Isolate the faulty physical memory address through the memory redundancy mechanism and obtain a new physical memory address; where the space of the new physical memory address is equal to the space size of the faulty physical memory address; the faulty physical memory address corresponding to the virtual memory address is Data backup in the virtual memory address is mapped to a new physical memory address for migrating data to the new physical memory address. It can be seen that the above solution permanently isolates the faulty memory through the memory redundancy mechanism.
  • the faulty memory can be isolated at the software level by changing the virtual memory mapping position without losing the data in the faulty memory; not only It can effectively reduce the downtime rate caused by memory failure, effectively reduce unnecessary memory replacement, and greatly reduce the cost of operation and maintenance.
  • monitoring the fault information of the server's memory includes:
  • Monitor memory fault information through MCA technology record the fault information in the interrupt mask control register, and generate a fault log.
  • MCA Hardware Error Detection Architecture
  • ECC Error Correcting Code
  • MSR Model Specific Register
  • DIMM dual-inline-memory-modules
  • IMC interrupt mask control register
  • the fault information of the memory is monitored through MCA technology, the fault information is recorded in the interrupt mask control register, and a fault log is generated, thereby monitoring and saving the fault information.
  • monitoring the fault information of the server's memory includes:
  • the process of monitoring the fault information of the server's memory during the process of monitoring the fault information of the server's memory, some memory faults that can be corrected or are within the allowable range may be detected. Therefore, it cannot be determined that the memory has failed just because the fault information appears. Specifically, during the process of monitoring the fault information of the server's memory, it is judged whether the number of fault information is greater than the second threshold within the first preset time, that is, it is judged whether the number of fault information within a time period exceeds the allowable fault.
  • the number of fault information decreases with a second preset time period, and the second preset time is less than the first preset time, that is, when some faults that can be corrected or are within the allowable range occur, the number of fault information decreases with a time period.
  • the number is decremented periodically and eventually returns to zero, without triggering memory fault handling.
  • there is no limit on the second threshold which depends on the specific implementation situation.
  • There is no limit on the first preset time and the second preset time which depends on the specific implementation situation.
  • obtaining the faulty physical memory address and its corresponding virtual memory address based on the fault information includes:
  • the fault physical memory address can be obtained by parsing the detailed information of the fault in the IMC.
  • the operating system will also record the detailed information of the error into the fault log MCELOG.
  • the operating system manages the physical memory through virtual memory. Therefore, the fault physical memory can be obtained through the MMU address translation unit according to the fault log. The virtual memory address corresponding to the memory address.
  • the faulty physical memory address and virtual memory address are obtained to facilitate subsequent changes. Map address.
  • FIG. 2 is a flow chart of yet another memory fault processing method provided by an embodiment of the present application. As shown in Figure 2, after mapping the virtual memory address to the new physical memory address, the following steps can be performed:
  • Step S17 Mark the faulty physical memory address.
  • Step S18 Trigger a memory fault alarm.
  • the failed physical memory address is marked in the operating system kernel to ensure that subsequent applications will not be allocated to the physical memory and prevent memory failures from occurring again.
  • a memory fault alarm is triggered to prompt maintenance personnel to perform maintenance on the faulty memory.
  • the memory fault processing method is described in detail, and this application also provides corresponding embodiments of a memory fault processing device. It should be noted that this application describes the embodiments of the device part from two perspectives, one is based on the functional module perspective, and the other is based on the hardware structure perspective.
  • FIG 3 is a schematic structural diagram of a memory fault handling device provided by an embodiment of the present application. As shown in Figure 3, the memory fault handling device includes:
  • the monitoring module 10 is used to monitor the fault information of the memory of the server to confirm that the memory has failed.
  • the first acquisition module 11 is used to acquire the redundant space of the memory.
  • the judgment module 12 is used to judge whether the redundant space is less than the first threshold, and triggers the second acquisition module in response to the redundant space being not less than the first threshold.
  • the second acquisition module 13 is used to acquire the faulty physical memory address and its corresponding virtual memory address according to the fault information.
  • the redundancy module 14 is used to isolate the faulty physical memory address through the memory redundancy mechanism and obtain a new physical memory address; wherein the space of the new physical memory address is equal to the space size of the faulty physical memory address.
  • the data backup module 15 is used to back up the data in the faulty physical memory address corresponding to the virtual memory address.
  • the mapping module 16 is used to map the virtual memory address to a new physical memory address for migrating data to the new physical memory address.
  • Part of the embodiments of the device shown in Figure 3 correspond to the embodiments of the method part. Therefore, for the embodiments of the device part, please refer to the description of the embodiment of the method part, and will not be described again here.
  • Each module in the device shown in Figure 3 above can be implemented in whole or in part by software, hardware and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • FIG 4 is a schematic structural diagram of another memory fault handling device provided by an embodiment of the present application. As shown in Figure 4, the memory fault handling device includes:
  • Memory 20 for storing computer readable instructions.
  • the processor 21 is configured to execute computer-readable instructions to implement the memory fault handling method mentioned in any embodiment.
  • the computer readable instructions when executed by the one or more processors 21 , cause the one or more processors 21 to perform the steps of the memory fault handling method in any embodiment. .
  • the memory fault handling device provided by the embodiment of FIG. 4 may include but is not limited to smart phones, tablet computers, notebook computers or desktop computers.
  • the processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
  • the processor 21 can adopt at least one hardware form among a digital signal processor (Digital Signal Processor, DSP), a field-programmable gate array (Field-Programmable Gate Array, FPGA), and a programmable logic array (Programmable Logic Array, PLA). to fulfill.
  • DSP Digital Signal Processor
  • FPGA field-programmable gate array
  • PROgrammable Logic Array PLA
  • the processor 21 may also include a main processor and a co-processor.
  • the main processor is a processor used to process data in the wake-up state, also called a central processing unit (Central Processing Unit, CPU); the co-processor is A low-power processor used to process data in standby mode.
  • CPU Central Processing Unit
  • the processor 21 may be integrated with a graphics processor (Graphics Processing Unit, GPU), and the GPU is responsible for rendering and drawing content that needs to be displayed on the display screen.
  • the processor 21 may also include an artificial intelligence (Artificial Intelligence, AI) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • Memory 20 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 20 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.
  • the memory 20 is at least used to store the following computer readable instructions 201. After the computer readable instructions are loaded and executed by the processor 21, the relevant steps of the memory fault handling method disclosed in any of the foregoing embodiments can be implemented.
  • the resources stored in the memory 20 may also include the operating system 202, data 203, etc., and the storage method may be short-term storage or permanent storage.
  • the operating system 202 may include Windows, Unix, Linux, etc.
  • the data 203 may include but is not limited to data related to the memory fault handling method.
  • the memory fault handling device may also include a display screen 22 , an input/output interface 23 , a communication interface 24 , a power supply 25 and a communication bus 26 .
  • FIG. 4 does not constitute a limitation on the memory fault handling device, and may include more or fewer components than shown in the figure.
  • This application also provides a corresponding embodiment of a computer-readable storage medium.
  • Computer-readable instructions are stored on the computer-readable storage medium. When the computer-readable instructions are executed by the processor, the steps recorded in the above method embodiments are implemented.
  • Embodiments of the present application also provide one or more non-volatile computer-readable storage media storing computer-readable instructions. When executed by one or more processors, the computer-readable instructions cause one or more processors to Perform the steps of the memory fault handling method in any embodiment.
  • the methods in the above embodiments are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , execute all or part of the steps of the methods of various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

本申请实施例公开了一种内存故障处理方法,通过监测服务器的内存的故障信息,获取内存的冗余空间;响应于冗余空间不小于第一阈值,根据故障信息获取故障物理内存地址及其对应的虚拟内存地址;通过内存的冗余机制将故障物理内存地址隔离,并获取新的物理内存地址;将故障物理内存地址中的数据备份,映射虚拟内存地址至新的物理内存地址,以用于将数据迁移至新的物理内存地址。

Description

一种内存故障处理方法、装置及计算机可读存储介质
相关申请的交叉引用
本申请要求于2022年4月8日提交中国专利局,申请号为CN202210362920.2,申请名称为“一种内存故障处理方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种内存故障处理方法、装置及计算机可读存储介质。
背景技术
服务器内存也称内存(Random Access Memory,RAM),具有一些特有的技术,例如错误检查和纠正(Error Correcting Code,ECC)等,从而有着极高的稳定性和纠错性能。现代操作系统对于服务器内存的访问,并不是直接访问物理内存的,而是通过一个中间层,这个中间层在操作系统中被称为虚拟内存(Virtual Memory,VM);操作系统通过VM,去访问VM映射的对应的物理内存。同时也可以改变虚拟内存映射的物理内存地址,从而使操作系统访问该物理内存地址。
在服务器的运行使用中,服务器硬件故障诊断和故障预测是服务器运行维护领域的技术难点。其中由内存引起的服务器故障是所有故障中占比最高的,目前缺乏有效诊断服务器内存故障的方案。
发明内容
本申请实施例提供了一种内存故障处理方法,包括:
监测服务器的内存的故障信息,以确认内存发生故障;
获取内存的冗余空间;
判断冗余空间是否小于第一阈值;
响应于冗余空间不小于第一阈值,根据故障信息获取故障物理内存地址及其对应的虚拟内存地址;
通过内存的冗余机制将故障物理内存地址隔离,并获取新的物理内存地址;其中,新的物理内存地址的空间与故障物理内存地址的空间大小相等;
将虚拟内存地址对应的故障物理内存地址中的数据备份;以及
映射虚拟内存地址至新的物理内存地址,以用于将数据迁移至新的物理内存地址。
本申请实施例还提供了一种内存故障处理装置,包括:
监测模块,用于监测服务器的内存的故障信息,以确认内存发生故障;
第一获取模块,用于获取内存的冗余空间;
判断模块,用于判断冗余空间是否小于第一阈值,响应于冗余空间不小于第一阈值,触发第二获取模块;
第二获取模块,用于根据故障信息获取故障物理内存地址及其对应的虚拟内存地址;
冗余模块,用于通过内存的冗余机制将故障物理内存地址隔离,并获取新的物理内存地址;其中,新的物理内存地址的空间与故障物理内存地址的空间大小相等;
数据备份模块,用于将虚拟内存地址对应的故障物理内存地址中的数据备份;以及
映射模块,用于映射虚拟内存地址至新的物理内存地址,以用于将数据迁移至新的物理内存地址。
本申请实施例还提供了又一种内存故障处理装置,包括:
存储器,用于存储计算机可读指令;以及
处理器,用于执行计算机可读指令以实现任一实施例中的内存故障处理方法。
本申请实施例还提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行任一实施例中的内存故障处理方法的步骤。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图做简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为根据一个或多个实施例中一种内存故障处理方法的流程图;
图2为根据一个或多个实施例中又一种内存故障处理方法的流程图;
图3为根据一个或多个实施例中一种内存故障处理装置的结构示意图;
图4为根据一个或多个实施例中又一种内存故障处理装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下,所获得的所有其他实施例,都属于本申请保护范围。
本申请实施例的核心之一是提供可靠的内存故障处理方法、装置及计算机可读存储介质。
为了使本技术领域的人员更好地理解本申请实施例的方案,下面结合附图和具体实施方式对本申请实施例作进一步的详细说明。
在服务器的运行使用中,服务器硬件故障诊断和故障预测是服务器运行维护领域的痛点也是技术难点。其中由内存引起的服务器故障是所有故障中占比最高的,因此如果能够有效诊断服务器内存故障,并且对故障进行技术隔离,便可以有效降低服务器故障。本申请实施例提供了一种内存故障处理方法。图1为本申请实施例提供的一种内存故障处理方法的流程图。如图1所示,内存故障处理方法可以包括:
步骤S10:监测服务器的内存的故障信息,以确认内存发生故障。
步骤S11:获取内存的冗余空间。
步骤S12:判断冗余空间是否小于第一阈值;若否,进入步骤S13。
步骤S13:根据故障信息获取故障物理内存地址及其对应的虚拟内存地址。
步骤S14:通过内存的冗余机制将故障物理内存地址隔离,并获取新的物理内存地址;其中,新的物理内存地址的空间与故障物理内存地址的空间大小相等。
步骤S15:将虚拟内存地址对应的故障物理内存地址中的数据备份。
步骤S16:映射虚拟内存地址至新的物理内存地址,以用于将数据迁移至新的物理内存地址。
可以理解的是,在服务器运行过程中,内存可能会因为一些原因发生故障。其中,内存发生的故障分为两类,一类是可纠正错误(corrected error,CE),另一类错误是不可纠正错误(uncorrected error,UCE)。发生CE,内存通过ECC机制可以自动进行纠正, 但是过多或者频繁的CE往往可能预示着UCE的发生;而如果内存发生UCE,一般会伴随服务器宕机的发生,属于服务器严重故障。所以发现CE后,要根据情况进行相应处理。因此在本实施例中,首先对内存的故障信息进行监测,通过发现作为故障信息的CE,跟进CE发生的情况采取相应的策略,来避免UCE的发生。在本实施例中对于故障信息的监测方式不做限制,根据具体的实施情况而定。
在得到内存发生故障的信息后,对内存的冗余空间进行获取。可以理解的是,内存的生产厂商在生产内存时,为了防止内存的部分物理空间损坏导致内存无法使用,采用的方法是内存空间冗余。例如一个标称128M的内存颗粒,往往实际的可以使用的内存空间可能是130M;而多出来的2M便是内存的冗余空间。内存出厂之前,厂商会对内存进行全面的测试,发现正常物理内存的损坏区域,然后将通过内存固件地址编码的方式,将损坏的物理内存空间重定向到冗余物理内存空间相同大小的区域。这样便可以保证128M的空间都是可以使用的。需要注意的是,如果损坏的空间大于2M,则冗余已不够,则此内存必须废弃。因此为了确定发生故障的内存的冗余空间是否能够对故障内存进行冗余,在获取到内存的冗余空间后判断其是否小于第一阈值。本实施例中对于第一阈值不做限制,根据具体的实施情况而定。如果冗余空间不小于第一阈值,则确定该内存的冗余空间足够,可以进行后续的冗余操作。
在确定该内存的冗余空间足够之后,根据故障信息获取故障物理内存地址,及其对应的虚拟内存地址。需要注意的是,一般的现代操作系统访问内存的访问并不是直接访问真实物理地址,操作系统对物理内存的管理是通过一种叫虚拟内存(VM)的机制来进行的。具体地,程序访问内存并不是真实的物理内存地址,而是操作系统通过中央处理器(Central Processing Unit,CPU)的内存管理单元(Memory Management Unit,MMU)地址转换单元进行转换的,VM与真实物理地址之间存在映射关系。操作系统会将内存划分为多个空间,划分给不同的程序来使用,应用程序通过虚拟内存地址空间来使用内存。因此,故障的内存的故障物理内存地址也存在其对应的虚拟内存地址。在本实施例中,对于故障内存地址和虚拟内存地址的获取方式不做限制,根据具体的实施情况而定。
在得到故障物理内存地址后,需要通过内存的冗余机制将故障内存地址隔离,并获取新的物理内存地址。具体地,内存的冗余是通过封装后修复(Post Package Repair,PPR)技术实现。这是一种内存修复手段,PPR技术可以把内存中损坏的部分行,用冗余的行代替,从而实现内存的冗余;新的物理内存地址的空间与故障物理内存地址的空间大小相等,这样才能存储故障物理内存地址的数据。同时,将上述获取到的虚拟内存地 址对应的故障物理内存地址中的数据进行备份,防止故障物理内存地址中的数据丢失。最后将该虚拟内存地址映射至新的物理内存地址,以用于将备份好的故障物理内存地址中的数据迁移至新的物理内存地址中,最终实现对故障内存的处理。
本实施例中,通过监测服务器的内存的故障信息,获取内存的冗余空间,判断冗余空间是否小于第一阈值;若否,根据故障信息获取故障物理内存地址及其对应的虚拟内存地址;通过内存的冗余机制将故障物理内存地址隔离,并获取新的物理内存地址;其中,新的物理内存地址的空间与故障物理内存地址的空间大小相等;将虚拟内存地址对应的故障物理内存地址中的数据备份,映射虚拟内存地址至新的物理内存地址,以用于将数据迁移至新的物理内存地址。由此可知,上述方案通过内存冗余机制将故障内存实现永久隔离,同时能够在操作系统运行过程中,通过改变虚拟内存映射位置将故障内存在软件层面隔离,不丢失故障内存中的数据;不但可以有效降低由于内存故障导致的宕机率,而且可以有效减少不必要的内存更换,极大降低运行维护的成本。
在一些实施例中,监测服务器的内存的故障信息,包括:
通过MCA技术监测内存的故障信息,将故障信息记录在中断屏蔽控制寄存器中,并生成故障日志。
其中,对于故障信息的监测方式可以不做限制,根据具体的实施情况而定。
需要注意的是,英特尔(Intel)从奔腾4开始的CPU中增加了一种机制,称为硬件错误检测架构(Machine Check Architecture,MCA)。它用来检测硬件错误,比如系统总线错误、误差校正码(Error Correcting Code,ECC)错误、奇偶校验错误等等。这套系统通过一定数量的特殊模块寄存器(Model Specific Register,MSR)来实现,这些MSR分为两个部分,一部分用来进行设置,另一部分用来描述发生的硬件错误。MCA技术架构中,将双列直插式存储模块(Dual-Inline-Memory-Modules,DIMM)作为内存,这是一种奔腾CPU推出后出现的新型内存条,它提供了64位的数据通道。
通过MCA技术,无论是发生CE还是发生UCE,发生的故障详细信息都会记录到中断屏蔽控制寄存器(IMC)当中。IMC为MCA架构的一组寄存器,可以用于存储故障的详细信息。如果内存发生CE故障,则CPU会基于MCA技术架构上报该错误的详细信息。在Linux系统中,操作系统同时会将该错误的详细信息记录到故障日志MCELOG中,以用于后续的操作。
本实施例中,通过MCA技术监测内存的故障信息,将故障信息记录在中断屏蔽控制寄存器中,并生成故障日志,从而实现了对故障信息的监测与保存。
在一些实施例中,监测服务器的内存的故障信息,包括:
判断故障信息的数量在第一预设时间内是否大于第二阈值;其中,故障信息的数量以第二预设时间为周期递减,第二预设时间小于第一预设时间;
若是,则确认内存发生故障,进入到获取内存的冗余空间的步骤。
在一些可选的实施方式中,响应于故障信息的数量在第一预设时间内大于第二阈值,确认内存发生故障,执行获取内存的冗余空间的步骤。
在具体实施中,监测服务器的内存的故障信息的过程中,可能会监测到一些可以纠正的或在允许范围内的内存故障,因此不能因为出现了故障信息就认定该内存发生了故障。具体地,监测服务器的内存的故障信息的过程中,判断故障信息的数量在第一预设时间内是否大于第二阈值,即判断故障信息的数量在一个时间段内的数量是否超过允许的故障数量;其中,故障信息的数量以第二预设时间周期递减,第二预设时间小于第一预设时间,即当出现一些可以纠正或在允许范围内的故障,故障信息的数量以一个时间周期进行递减,最终使其数量归零,不会触发内存故障处理。在本实施例中,对于第二阈值不做限制,根据具体的实施情况而定,对于第一预设时间和第二预设时间不做限制,根据具体的实施情况而定。当确定故障信息的数量在第一预设时间内大于第二阈值,则确定在第一预设时间内故障信息的数量超过允许范围,因此确定内存发生了故障,可以进入后续对内存故障进行处理的步骤。
本实施例中,通过判断故障信息的数量在第一预设时间内是否大于第二阈值;其中,故障信息的数量以第二预设时间为周期递减,第二预设时间小于第一预设时间;若是,则确认内存发生故障,进入到获取内存的冗余空间的步骤,实现了内存是否发生故障的准确判断。
在一些实施例中,根据故障信息获取故障物理内存地址及其对应的虚拟内存地址,包括:
解析中断屏蔽控制寄存器以获取到故障物理内存地址;
根据故障日志通过内存管理单元获取故障物理内存地址对应的虚拟内存地址。
可以对于故障内存地址和虚拟内存地址的获取方式不做限制,根据具体的实施情况而定。在一些情况下,由于发生的故障详细信息都会记录到中断屏蔽控制寄存器(IMC)当中,因此可以通过解析IMC中的故障的详细信息获取到故障物理内存地址。由于在Linux系统中,操作系统同时会将该错误的详细信息记录到故障日志MCELOG中,同时操作系统对物理内存的管理是通过虚拟内存实现,因此根据故障日志通过MMU地址转换单元能够获取故障物理内存地址对应的虚拟内存地址。
通过解析中断屏蔽控制寄存器以获取到故障物理内存地址,并根据故障日志通过内存管理单元获取故障物理内存地址对应的虚拟内存地址,实现了故障物理内存地址和虚拟内存地址的获取,以便于后续更改映射地址。
图2为本申请实施例提供的又一种内存故障处理方法的流程图。如图2所示,在映射虚拟内存地址至新的物理内存地址之后,可以执行以下步骤:
步骤S17:标记故障物理内存地址。
步骤S18:触发内存故障告警。
在具体实施中,对出现故障的物理内存地址在操作系统内核做一下标记,保证后面的应用程序不会分配到该物理内存,防止再次出现内存故障。
在标记故障物理内存地址之后,触发内存故障告警,以提示维修人员对故障内存进行维护。
如图2所示,在具体实施中,若判断冗余空间小于第一阈值,还可以执行以下步骤:
S19:输出提示更换内存的信息。
可以理解的是,当判断冗余空间小于第一阈值,则确认当前内存的故障空间大于内存的冗余空间,因此冗余空间不足以对故障内存空间进行冗余替换,此时该内存故障则无法处理,只能更换对该内存进行更换。因此当判断冗余空间小于第一阈值时,输出提示更换内存的信息,从而提示维修人员对故障内存进行更换。
在上述实施例中,对于内存故障处理方法进行了详细描述,本申请还提供内存故障处理装置对应的实施例。需要说明的是,本申请从两个角度对装置部分的实施例进行描述,一种是基于功能模块的角度,另一种是基于硬件结构的角度。
图3为本申请实施例提供的一种内存故障处理装置的结构示意图,如图3所示,内存故障处理装置包括:
监测模块10,用于监测服务器的内存的故障信息,以确认内存发生故障。
第一获取模块11,用于获取内存的冗余空间。
判断模块12,用于判断冗余空间是否小于第一阈值,响应于冗余空间不小于第一阈值,触发第二获取模块。
第二获取模块13,用于根据故障信息获取故障物理内存地址及其对应的虚拟内存地址。
冗余模块14,用于通过内存的冗余机制将故障物理内存地址隔离,并获取新的物理内存地址;其中,新的物理内存地址的空间与故障物理内存地址的空间大小相等。
数据备份模块15,用于将虚拟内存地址对应的故障物理内存地址中的数据备份。
映射模块16,用于映射虚拟内存地址至新的物理内存地址,以用于将数据迁移至新的物理内存地址。
图3所示的装置,部分的实施例与方法部分的实施例相互对应,因此该装置部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。
关于图3所示的装置的具体限定可以参见上文任一实施例对于内存故障处理方法的限定,在此不再赘述。上述图3所示的装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
图4为本申请实施例提供的又一种内存故障处理装置的结构示意图,如图4所示,该内存故障处理装置包括:
存储器20,用于存储计算机可读指令。
处理器21,用于执行计算机可读指令以实现任一实施例中所提到的内存故障处理的方法。在一些可选的实施方式中,计算机可读指令被所述一个或多个处理器21执行时,使得所述一个或多个处理器21执行任一实施例中的内存故障处理的方法的步骤。
图4的实施例提供的内存故障处理装置可以包括但不限于智能手机、平板电脑、笔记本电脑或台式电脑等。
其中,处理器21可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器21可以采用数字信号处理器(Digital Signal Processor,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器21也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称中央处理器(Central Processing Unit,CPU);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器21可以在集成有图像处理器(Graphics Processing Unit,GPU),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器21还可以包括人工智能(Artificial Intelligence,AI)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器20可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器20还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。存储器20至少用于存储以下计算机可读指令201, 其中,该计算机可读指令被处理器21加载并执行之后,能够实现前述任一实施例公开的内存故障处理方法的相关步骤。另外,存储器20所存储的资源还可以包括操作系统202和数据203等,存储方式可以是短暂存储或者永久存储。其中,操作系统202可以包括Windows、Unix、Linux等。数据203可以包括但不限于内存故障处理方法涉及到的数据。
在一些实施例中,内存故障处理装置还可包括有显示屏22、输入输出接口23、通信接口24、电源25以及通信总线26。
本领域技术人员可以理解,图4中示出的结构并不构成对内存故障处理装置的限定,可以包括比图示更多或更少的组件。
本申请还提供一种计算机可读存储介质对应的实施例。计算机可读存储介质上存储有计算机可读指令,计算机可读指令被处理器执行时实现如上述方法实施例中记载的步骤。
本申请实施例还提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行任一实施例中的内存故障处理方法的步骤。
可以理解的是,如果上述实施例中的方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上对本申请所提供的一种内存故障处理方法、装置及计算机可读存储介质进行了详细介绍。说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作 之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (20)

  1. 一种内存故障处理方法,其特征在于,包括:
    监测服务器的内存的故障信息,以确认所述内存发生故障;
    获取所述内存的冗余空间;
    判断所述冗余空间是否小于第一阈值;
    响应于所述冗余空间不小于所述第一阈值,根据所述故障信息获取故障物理内存地址及其对应的虚拟内存地址;
    通过所述内存的冗余机制将所述故障物理内存地址隔离,并获取新的物理内存地址;其中,所述新的物理内存地址的空间与所述故障物理内存地址的空间大小相等;
    将所述虚拟内存地址对应的所述故障物理内存地址中的数据备份;以及
    映射所述虚拟内存地址至所述新的物理内存地址,以用于将所述数据迁移至所述新的物理内存地址。
  2. 根据权利要求1所述的内存故障处理方法,其特征在于,所述监测服务器的内存的故障信息,包括:
    通过MCA技术监测所述内存的所述故障信息,将所述故障信息记录在中断屏蔽控制寄存器中,并生成故障日志。
  3. 根据权利要求2所述的内存故障处理方法,其特征在于,所述通过MCA技术监测所述内存的所述故障信息,包括:通过所述MCA技术监测所述内存的系统总线错误、误差校正码错误或奇偶校验错误的故障信息。
  4. 根据权利要求2或3所述的内存故障处理方法,其特征在于,所述根据所述故障信息获取故障物理内存地址及其对应的虚拟内存地址,包括:
    解析所述中断屏蔽控制寄存器以获取到所述故障物理内存地址;以及
    根据所述故障日志通过内存管理单元获取所述故障物理内存地址对应的所述虚拟内存地址。
  5. 根据权利要求2至4任一项所述的内存故障处理方法,其特征在于,还包括:
    在所述将所述故障信息记录在中断屏蔽控制寄存器中之后,响应于所述故障信息为可纠正错误的信息,指示中央处理器从中断屏蔽控制寄存器中获取所述故障信息并基于MCA技术架构进行上报。
  6. 根据权利要求1至5任一项所述的内存故障处理方法,其特征在于,所述监测服 务器的内存的故障信息,包括:
    判断所述故障信息的数量在第一预设时间内是否大于第二阈值;以及
    响应于所述故障信息的数量在第一预设时间内大于所述第二阈值,确认所述内存发生故障,执行所述获取所述内存的冗余空间的步骤。
  7. 根据权利要求6所述的内存故障处理方法,其特征在于,所述故障信息的数量以第二预设时间为周期递减,所述第二预设时间小于所述第一预设时间。
  8. 根据权利要求1至5任一项所述的内存故障处理方法,其特征在于,所述监测服务器的内存的故障信息,包括:
    响应于所述故障信息的数量在第一预设时间内不大于所述第二阈值,确认所述内存不发生故障,并继续监测服务器的内存的故障信息,直至所述故障信息的数量在第一预设时间内大于第二阈值,确认所述内存发生故障,执行所述获取所述内存的冗余空间的步骤。
  9. 根据权利要求8所述的内存故障处理方法,其特征在于,所述响应于所述故障信息的数量在第一预设时间内不大于所述第二阈值,确认所述内存不发生故障,包括:
    响应于所述故障信息的数量在第一预设时间内不大于所述第二阈值,且所述故障信息的数量以第二预设时间为周期递减,确认所述内存不发生故障;其中,所述第二预设时间小于所述第一预设时间。
  10. 根据权利要求8所述的内存故障处理方法,其特征在于,所述响应于所述故障信息的数量在第一预设时间内不大于所述第二阈值,确认所述内存不发生故障,包括:
    响应于所述故障信息的数量在第一预设时间内不大于所述第二阈值,且所述故障信息的数量以第二预设时间为周期递减至零,确认所述内存不发生故障;其中,所述第二预设时间小于所述第一预设时间。
  11. 根据权利要求1至10任一项所述的内存故障处理方法,其特征在于,还包括:
    在所述映射所述虚拟内存地址至所述新的物理内存地址之后,标记所述故障物理内存地址。
  12. 根据权利要求11所述的内存故障处理方法,其特征在于,所述标记所述故障物理内存地址,包括:
    对所述故障物理内存地址在操作系统内核做标记。
  13. 根据权利要求12所述的内存故障处理方法,其特征在于,还包括:
    在所述标记所述故障物理内存地址之后,触发内存故障告警。
  14. 根据权利要求1至13任一项所述的内存故障处理方法,其特征在于,所述监测 服务器的内存的故障信息,以确认所述内存发生故障,包括:
    确定服务器的内存故障信息为可纠正错误信息。
  15. 根据权利要求1至14任一项所述的内存故障处理方法,其特征在于,所述通过所述内存的冗余机制将所述故障物理内存地址隔离,包括:
    通过封装后修复技术将所述故障物理内存地址隔离。
  16. 根据权利要求15所述的内存故障处理方法,其特征在于,所述通过封装后修复技术将所述故障物理内存地址隔离,包括:
    将内存中损害的部分行,用冗余的行替代。
  17. 根据权利要求1至16任意一项所述的内存故障处理方法,其特征在于,还包括:
    在所述映射所述虚拟内存地址至所述新的物理内存地址,以用于将所述数据迁移至所述新的物理内存地址之后,返回执行监测服务器的内存的故障信息,以确认所述内存发生故障,获取所述内存的冗余空间,判断所述冗余空间是否小于第一阈值的步骤;
    响应于所述冗余空间小于第一阈值,输出提示更换所述内存的信息。
  18. 一种内存故障处理装置,其特征在于,包括:
    监测模块,用于监测服务器的内存的故障信息,以确认所述内存发生故障;
    第一获取模块,用于获取所述内存的冗余空间;
    判断模块,用于判断所述冗余空间是否小于第一阈值,响应于所述冗余空间不小于所述第一阈值,触发第二获取模块;
    所述第二获取模块,用于根据所述故障信息获取故障物理内存地址及其对应的虚拟内存地址;
    冗余模块,用于通过所述内存的冗余机制将所述故障物理内存地址隔离,并获取新的物理内存地址;其中,所述新的物理内存地址的空间与所述故障物理内存地址的空间大小相等;
    数据备份模块,用于将所述虚拟内存地址对应的所述故障物理内存地址中的数据备份;以及
    映射模块,用于映射所述虚拟内存地址至所述新的物理内存地址,以用于将所述数据迁移至所述新的物理内存地址。
  19. 一种内存故障处理装置,其特征在于,包括:
    存储器,用于存储计算机可读指令;以及
    处理器,用于执行所述计算机可读指令以实现如权利要求1至17任一项所述的内存 故障处理方法。
  20. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至17任一项所述的内存故障处理方法的步骤。
PCT/CN2022/115340 2022-04-08 2022-08-28 一种内存故障处理方法、装置及计算机可读存储介质 WO2023193396A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210362920.2A CN114461436A (zh) 2022-04-08 2022-04-08 一种内存故障处理方法、装置及计算机可读存储介质
CN202210362920.2 2022-04-08

Publications (1)

Publication Number Publication Date
WO2023193396A1 true WO2023193396A1 (zh) 2023-10-12

Family

ID=81418248

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115340 WO2023193396A1 (zh) 2022-04-08 2022-08-28 一种内存故障处理方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114461436A (zh)
WO (1) WO2023193396A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118034991A (zh) * 2024-04-11 2024-05-14 北京开源芯片研究院 内存数据的访问方法、装置、电子设备及可读存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461436A (zh) * 2022-04-08 2022-05-10 苏州浪潮智能科技有限公司 一种内存故障处理方法、装置及计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631721A (zh) * 2012-08-23 2014-03-12 华为技术有限公司 一种隔离内存中坏块的方法及系统
CN106133704A (zh) * 2015-01-19 2016-11-16 华为技术有限公司 内存故障隔离方法和装置
US10268612B1 (en) * 2016-09-23 2019-04-23 Amazon Technologies, Inc. Hardware controller supporting memory page migration
CN114064333A (zh) * 2020-08-05 2022-02-18 华为技术有限公司 一种内存故障处理方法和装置
CN114461436A (zh) * 2022-04-08 2022-05-10 苏州浪潮智能科技有限公司 一种内存故障处理方法、装置及计算机可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103197999B (zh) * 2013-03-22 2016-08-03 北京百度网讯科技有限公司 一种内存故障自动定位方法及装置
CN109086151A (zh) * 2017-06-13 2018-12-25 中兴通讯股份有限公司 一种服务器上隔离内存故障的方法及装置
CN112667422A (zh) * 2019-10-16 2021-04-16 华为技术有限公司 内存故障处理方法及装置、计算设备、存储介质
CN112667445B (zh) * 2021-01-12 2022-05-03 长鑫存储技术有限公司 封装后的内存修复方法及装置、存储介质、电子设备
CN113282434B (zh) * 2021-07-19 2021-10-29 苏州浪潮智能科技有限公司 一种基于封装后修复技术的内存修复方法及相关组件
CN113742123A (zh) * 2021-08-20 2021-12-03 新华三技术有限公司合肥分公司 内存故障信息记录方法及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631721A (zh) * 2012-08-23 2014-03-12 华为技术有限公司 一种隔离内存中坏块的方法及系统
CN106133704A (zh) * 2015-01-19 2016-11-16 华为技术有限公司 内存故障隔离方法和装置
US10268612B1 (en) * 2016-09-23 2019-04-23 Amazon Technologies, Inc. Hardware controller supporting memory page migration
CN114064333A (zh) * 2020-08-05 2022-02-18 华为技术有限公司 一种内存故障处理方法和装置
CN114461436A (zh) * 2022-04-08 2022-05-10 苏州浪潮智能科技有限公司 一种内存故障处理方法、装置及计算机可读存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118034991A (zh) * 2024-04-11 2024-05-14 北京开源芯片研究院 内存数据的访问方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN114461436A (zh) 2022-05-10

Similar Documents

Publication Publication Date Title
US9495233B2 (en) Error framework for a microprocesor and system
US10180866B2 (en) Physical memory fault mitigation in a computing environment
US8352779B2 (en) Performing redundant memory hopping
US7058782B2 (en) Method and apparatus for coordinating dynamic memory deallocation with a redundant bit line steering mechanism
US9606889B1 (en) Systems and methods for detecting memory faults in real-time via SMI tests
US7900084B2 (en) Reliable memory for memory controller with multiple channels
US5274646A (en) Excessive error correction control
US7478268B2 (en) Deallocation of memory in a logically-partitioned computer
EP3660681A1 (en) Memory fault detection method and device, and server
TW201346530A (zh) 機器檢查摘要暫存器
US10430267B2 (en) Determine when an error log was created
US20130007507A1 (en) Mechanism for advanced server machine check recovery and associated system software enhancements
WO2023193396A1 (zh) 一种内存故障处理方法、装置及计算机可读存储介质
US11809295B2 (en) Node mode adjustment method for when storage cluster BBU fails and related component
CN115495278B (zh) 异常修复方法、设备及存储介质
US9965346B2 (en) Handling repaired memory array elements in a memory of a computer system
WO2024082844A1 (zh) 一种内存条故障检测装置及检测方法
US8984333B2 (en) Automatic computer storage medium diagnostics
Kleen Mcelog: Memory error handling in user space
CN116795573A (zh) 一种内存不可修复错误处理方法、系统、设备及存储介质
TWI777259B (zh) 開機方法
CN117950900A (zh) 一种内存错误处理方法及计算设备
CN117687833A (zh) 测试数据安全的方法、装置及存储介质
US9921906B2 (en) Performing a repair operation in arrays
CN117992286A (zh) 一种内存故障的处置方法、系统、存储介质和终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22936324

Country of ref document: EP

Kind code of ref document: A1