WO2024066500A1 - 内存错误处理方法及装置 - Google Patents

内存错误处理方法及装置 Download PDF

Info

Publication number
WO2024066500A1
WO2024066500A1 PCT/CN2023/101096 CN2023101096W WO2024066500A1 WO 2024066500 A1 WO2024066500 A1 WO 2024066500A1 CN 2023101096 W CN2023101096 W CN 2023101096W WO 2024066500 A1 WO2024066500 A1 WO 2024066500A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
computer system
target
memory area
isolation
Prior art date
Application number
PCT/CN2023/101096
Other languages
English (en)
French (fr)
Inventor
买培培
吕洪发
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024066500A1 publication Critical patent/WO2024066500A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance

Definitions

  • the present application relates to the field of computer technology, and in particular to a memory error processing method and device.
  • RAM Random access memory
  • ECC error checking and correction
  • ADDDC adaptive double device data correction
  • the process of performing data migration and memory isolation on the memory area usually occupies a large amount of computer system resources, which may cause the computer system to be unable to efficiently execute other businesses that are currently being executed.
  • At least one memory error handling method and device is provided.
  • it can be determined whether the computer system is in an idle state based on several performance indicators of the computer system within the current time interval. Data migration and memory isolation are performed on the target memory area only when it is determined that the computer system is in an idle state. This can avoid affecting the efficient execution of other services of the computer system due to the performance of data migration and memory isolation on the target memory area.
  • a memory error handling method is provided, which is applied to a computer system including a memory.
  • the method comprises: when it is necessary to perform data migration and memory isolation on a target memory area where a correctable error CE occurs in the memory, several performance indicators of the aforementioned computer system in a current time interval can be first obtained, and whether the aforementioned computer system is in an idle state can be determined based on the several performance indicators; when it is determined that the aforementioned computer system is in an idle state, data migration and memory isolation are performed on the target memory area.
  • the aforementioned several performance indicators may include, but are not limited to, any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same non-uniform memory access structure (NUMA) as the target memory area.
  • NUMA non-uniform memory access structure
  • the method further includes: obtaining memory error information of the computer system; determining the target memory area and CE mode in the memory where CE occurs according to the memory error information; and determining whether it is necessary to perform data migration and memory isolation on the target memory area according to the CE mode.
  • UCE uncorrected errors
  • determining whether data migration and memory isolation need to be performed on the target memory area is determined according to the CE mode, including: when the CE mode belongs to several pre-configured target CE modes, determining that data migration and memory isolation need to be performed on the target memory area.
  • determining whether data migration and memory isolation need to be performed on the target memory area according to the CE mode includes: when the CE mode belongs to a plurality of pre-configured target CE modes, adding 1 to the frequency of CEs that occur in the target memory area that belong to the plurality of target CE modes; and when the frequency after the addition operation reaches a preset threshold, determining that data migration and memory isolation need to be performed on the target memory area. Data migration and memory isolation.
  • the aforementioned several target CE modes include at least one of the following CE modes: row CE, column CE and bank CE.
  • a memory error processing device which is deployed in a computer system including a memory.
  • the device includes: an indicator acquisition module, which is used to acquire several performance indicators of the computer system in the current time interval when it is necessary to perform data migration and memory isolation on the target memory area where a correctable error CE occurs in the module memory; a state judgment module, which is used to determine whether the computer system is in an idle state based on several performance indicators, and trigger the isolation processing module when the computer system is in an idle state; and the isolation processing module, which is used to perform data migration and memory isolation on the target memory area under the triggering of the state judgment module.
  • the plurality of performance indicators include any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same NUMA as the target memory area.
  • the device also includes: an information acquisition module, used to obtain memory error information of the computer system; a fault analysis module, used to determine the target memory area and CE mode where CE occurs in the memory based on the memory error information; and determine whether data migration and memory isolation need to be performed on the target memory area based on the CE mode.
  • the fault analysis module is specifically configured to determine that data migration and memory isolation need to be performed on a target memory area when the CE belongs to several pre-configured target CE modes.
  • the fault analysis module is specifically used to increase the frequency of CE belonging to several target CE modes occurring in the target memory area by 1 when the CE mode belongs to several pre-configured target CE modes; when the frequency after performing the addition operation reaches a preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area.
  • the several target CE modes include at least one of the following CE modes: row CE, column CE, and bank CE.
  • an embodiment of the present application provides a computing device, including a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method provided in the first aspect.
  • an embodiment of the present application provides a computer system, comprising a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method provided in the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program When executed in a computer, the computer implements the method provided in the first aspect.
  • a computer program or a computer program product is provided in an embodiment of the present application, wherein the computer program or the computer program product comprises instructions, and when the instructions are executed, the method provided in the first aspect is implemented.
  • a chip is provided in an embodiment of the present application, the chip comprising at least one processor and an interface, wherein the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method provided in the first aspect.
  • FIG1 is a schematic diagram of a computer system provided in an embodiment of the present application.
  • FIG2 is a flow chart of a memory error handling method provided in an embodiment of the present application.
  • FIG3 is a second schematic diagram of the structure of a computer system provided in an embodiment of the present application.
  • FIG4 is a third structural diagram of a computer system provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of a memory error handling device provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a computing device provided in an embodiment of the present application.
  • CE errors that occur in the memory of a computer system
  • UCE various error correction algorithms including ECC can usually be used to correct it.
  • ECC error correction algorithms
  • UCE it may cause the computer system to perform operations that are not supported by the computer system. The inability to accurately access the memory area where the UCE occurs may cause other problems and may even directly cause the computer system to stop running.
  • ADDDC adaptive double device data correction
  • the processor and basic input output system (BIOS) of the computer system can be implemented as corresponding firmware respectively, and the processor can connect several dual inline memory modules (DIMMs) through its memory controller, for example, two DIMMs such as DIMM0 and DIMM1 are connected through a single memory channel.
  • DIMMs dual inline memory modules
  • a single DIMM can include two ranks such as Rank0 and two ranks1; a single rank can include 18 chips such as chip 00 to chip 17, and chip 17 can be used as a redundant chip; a single chip can include n+1 logical banks such as bank 0 to bank n.
  • bank n of chip 00 in rank 0 of DIMM0 is determined to need data migration and memory isolation based on certain rules due to a CE error
  • the data stored in bank n of chip 00 in rank 0 of DIMM0 can be migrated to bank n of chip 17 in rank 0 of DIMM1 and bank n of chip 17 in rank 0 of DIMM0 through the ADDDC technology, and bank n of chip 00 in rank 0 of DIMM0 is isolated.
  • the data migrated to bank n of chip 17 in rank 0 of DIMM1 and the data migrated to bank n of chip 17 in rank 0 of DIMM0 can be used to recover the data originally stored in bank n of chip 00 in rank 0 of DIMM0.
  • ADDDC-MR adaptive double device data correction-multiple region
  • ADC-SR adaptive data correction-single region
  • ADDEC adaptive double device error correction
  • the memory area where CE occurs may also be a rank, a chip, a row belonging to a bank, or a column belonging to a bank, and so on.
  • a memory error handling method and device are provided in an embodiment of the present application.
  • it can be determined whether the computer system is in an idle state based on several performance indicators of the computer system in the current time interval, and data migration and memory isolation are performed on the target memory area only when it is determined that the computer system is in an idle state, so as to avoid affecting the efficient execution of other services of the computer system due to the execution of data migration and memory isolation on the target memory area.
  • FIG2 is a flowchart of a memory error handling method provided in an embodiment of the present specification.
  • the method may be executed by a processor, a computing device/computer system including a processor; more specifically, the processor, a computing device/computer system including a processor may execute a computer program/instruction to implement the various method steps shown in FIG2.
  • the aforementioned computing device/computer system may, for example, include but is not limited to a server, a switch, a router, a base station controller, a terminal or a computing acceleration card, etc.
  • the aforementioned server may generally be an all-in-one machine, or the aforementioned server may adopt a layered cloud architecture implemented based on a baseboard management controller (baseboard management controller, BMC). Please refer to FIG2.
  • the method may include but is not limited to part or all of the following steps S200 to S210. all.
  • Step S200 obtaining memory error information of the computer system.
  • the BIOS of the computer system can obtain corresponding memory error information through the memory controller of the processor.
  • the aforementioned memory error information can also be sent by the BIOS of the computer system to the BMC of the computer system, for example.
  • the aforementioned memory error information can also be sent by the BIOS of the computer system to the system management unit of the computer system, for example.
  • the aforementioned system management unit can be an operating system (OS) deployed in the computer system, and more specifically, it can be a functional module (such as a fault analysis module) included in the OS deployed in the computer system, or the system management unit can also be other firmware in the computer system other than the OS deployed therein.
  • OS operating system
  • the system management unit can also be other firmware in the computer system other than the OS deployed therein.
  • Step S202 determining a target memory area where CE occurs in the memory of the computer system and a CE mode of the CE that occurs according to the memory error information.
  • the BMC of the computer system can be used to determine the target memory area where CE occurs and the CE mode of the CE that occurs according to the memory error information.
  • the system management unit of the computer system can be used to determine the target memory area where CE occurs and the CE mode of the CE that occurs according to the memory error information.
  • feature analysis can be performed on the memory error information to determine whether the CE that occurs in the target memory area meets the corresponding CE mode; or, machine learning can be used to analyze the memory error information and other data related to the memory operating status to more accurately determine the CE mode of the CE that occurs in the target memory area.
  • CE modes may include row CE, column CE, bank CE, chip CE, and rank CE, etc.
  • Step S204 Determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode.
  • the BMC of the computer system can determine whether it is necessary to perform data migration and memory isolation on the target memory area according to the CE mode determined in step S202.
  • the system management unit of the computer system can determine whether it is necessary to perform data migration and memory isolation on the target memory area according to the CE mode determined in step S202.
  • step S204 when the CE mode determined in step S202 belongs to several pre-configured target CE modes, it can be determined in step S204 that data migration and memory isolation need to be performed on the target memory area; conversely, when the CE mode determined in step S202 does not belong to several pre-configured target CE modes, it can be determined in step S204 that data migration and memory isolation do not need to be performed on the target memory area.
  • the frequency of CE belonging to several target CE modes occurring in the target memory area can be increased by 1 in step S204. If the frequency after the addition operation reaches a preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area; conversely, if the frequency after the addition operation does not reach the preset threshold, it is determined that data migration and memory isolation do not need to be performed on the target memory area.
  • the aforementioned target CE modes may include but are not limited to: row CE, column CE and bank CE.
  • step S204 determines that data migration and memory isolation need to be performed on the target memory area
  • step S206 is continued to be executed to obtain several performance indicators of the computer system in the current time interval.
  • Step S208 determining whether the computer system is in an idle state according to a number of performance indicators.
  • the aforementioned step S208 may be implemented by a system management unit of the computer system.
  • the aforementioned performance indicators may include, but are not limited to, any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether the virtual machine that depends on the computer system and is in a busy state is located in the same NUMA as the target memory area.
  • memory bandwidth is the product of bus width, bus frequency, and the number of data packets exchanged in a clock cycle; forwarding bandwidth refers to the amount of data that can be transmitted on the line per unit time, and the unit is bps (bit per second); storage bandwidth refers to the amount of data accessed by the memory per unit time, also known as the number of bits or bytes read/written by the memory per unit time.
  • the business scores corresponding to the remaining performance indicators in the current time interval can be further determined based on pre-configured business rules, and then the weighted sum of each business score is performed to obtain a total score, and then whether the computer system is in an idle state is determined based on the size of the total score.
  • a virtual machine that relies on the computer system and is in a busy state
  • performance indicators such as processor occupancy, memory bandwidth, forwarding bandwidth, and storage bandwidth are all less than their respective corresponding preset reference values, it is determined that the computer system is in an idle state.
  • the performance indicators of the computer system obtained in the current time interval may not include whether the virtual machines that depend on the computer system and are busy are located in the same NUMA as the target memory area.
  • the computer system when the computer system is in idle state, the computer system should run in user state.
  • the virtual machines that rely on the computer system and are in busy state should be located in different NUMAs from the target memory area.
  • various indicators such as processor occupancy, memory bandwidth, forwarding bandwidth, and storage bandwidth should have relatively small values to ensure that the computer system has sufficient resources to support data migration and memory isolation of the target memory area, thereby avoiding affecting the efficient execution of other services that the computer system needs to execute due to data migration and memory isolation of the target memory area.
  • step S208 When it is determined in step S208 that the computer system is not in an idle state based on several performance indicators of the computer system in the current time interval, the aforementioned steps S206 and S208 can be periodically executed at corresponding time intervals until it is determined that the computer system is in an idle state, and then the following step S210 is executed.
  • Step S210 performing data migration and memory isolation on the target memory area.
  • the system management unit of the computer system can trigger the processor of the computer system to perform data migration and kernel isolation on the target memory area through the BIOS of the computer system.
  • the ADDDC technology can be used to achieve data migration and memory isolation on the target memory area.
  • adaptive double device data correction-multiple region ADDDC-MR
  • adaptive data correction-single region ADC-SR
  • adaptive double device error correction ADDEC
  • other technologies may also be used to achieve data migration and memory isolation on the target memory area.
  • the memory error handling device 50 includes: an indicator acquisition module 501, which is used to obtain several performance indicators of the computer system within the current time interval when it is necessary to perform data migration and memory isolation on the target memory area where CE occurs in the memory; a state judgment module 503, which is used to determine whether the computer system is in an idle state based on the several performance indicators, and trigger the isolation processing module when the computer system is in an idle state; the isolation processing module 505, which is used to perform data migration and memory isolation on the target memory area under the triggering of the state judgment module.
  • the several performance indicators include any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same non-uniform memory access structure NUMA as the target memory area.
  • the device also includes: an information acquisition module 507, used to obtain memory error information of the computer system; a fault analysis module 509, used to determine the target memory area and CE mode where CE occurs in the memory based on the memory error information; and determine whether it is necessary to perform data migration and memory isolation on the target memory area based on the CE mode.
  • an information acquisition module 507 used to obtain memory error information of the computer system
  • a fault analysis module 509 used to determine the target memory area and CE mode where CE occurs in the memory based on the memory error information; and determine whether it is necessary to perform data migration and memory isolation on the target memory area based on the CE mode.
  • the fault analysis module 509 is used to determine that data migration and memory isolation need to be performed on the target memory area when the CE mode belongs to several pre-configured target CE modes.
  • the fault analysis module 509 is used to, when the CE mode belongs to several pre-configured target CE modes, increase the frequency of CE occurring in the target memory area belonging to the several target CE modes by 1; when the frequency after performing the addition operation reaches a preset threshold, determine that data migration and memory isolation need to be performed on the target memory area.
  • the several target CE modes include at least one of the following CE modes: row CE, column CE and bank CE.
  • the memory error handling device 50 may correspond to executing the method described in the embodiment of the present application, and the aforementioned operations and other operations and/or functions respectively performed by each module in the memory error handling device 50 are respectively for realizing the corresponding processes of each method in Figure 2, which will not be repeated here for the sake of brevity.
  • the indicator acquisition module 501, the state judgment module 503, the isolation processing module 505, the information acquisition module 507 and the fault analysis module 509 included in the device can be implemented by software or by hardware.
  • the implementation of the indicator acquisition module 501 is introduced below by taking the indicator acquisition module 501 as an example.
  • the implementation of the state judgment module 503, the isolation processing module 505, the information acquisition module 507 and the fault analysis module 509 can refer to the implementation of the indicator acquisition module 501.
  • the indicator acquisition module 501 may include code running on a computing instance.
  • the computing instance may include a physical host (computing device), a virtual machine, or a container.
  • the indicator acquisition module 501 can be a device implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • the PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the computing device/computer system includes at least a processor and a memory, and a program is stored in the memory.
  • the processor executes the program, it can implement the units or modules of each step in the method shown in Figure 2.
  • FIG6 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • the computing device 600 includes at least one processor 601, a memory 602, and a communication interface 603.
  • the processor 601, the memory 602, and the communication interface 603 are connected in communication, and the communication connection can be realized by wired means (such as a bus) or by wireless means.
  • the communication interface 603 is used to receive data (such as write data) sent by other devices; the memory 602 stores computer instructions, and the processor 601 executes the computer instructions to execute the method in the aforementioned method embodiment.
  • the processor 601 may include a central processing unit CPU, and the processor 601 may also include other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the memory 602 may include a read-only memory and a random access memory, and provides instructions and data to the processor 601.
  • the memory 602 may also include a nonvolatile random access memory.
  • the memory 602 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the nonvolatile memory may be a read-only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static RAM
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • DR RAM direct rambus RAM
  • computing device 600 can execute the method shown in Figure 2 in the embodiment of the present application.
  • the detailed description of the implementation of the method is shown above, and for the sake of brevity, it will not be repeated here.
  • a computer-readable storage medium on which a computer program is stored.
  • the computer instructions are executed by a processor, the above-mentioned method is implemented.
  • a chip is provided in an embodiment of the present application.
  • the chip includes at least one processor and an interface.
  • the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method mentioned above.
  • a computer program or a computer program product is provided in an embodiment of the present application.
  • the computer program or the computer program product includes instructions. When the instructions are executed, the computer is caused to execute the above-mentioned method.
  • the steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be implemented using hardware, a software module executed by a processor, or a combination of the two.
  • the software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage known in the art. storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种内存错误处理方法,应用于包括内存的计算机系统,包括:在确定需要对内存中发生可纠正错误的目标内存区域执行数据迁移和内存隔离的情况下,可以获取计算机系统在当前时间间隔内的若干性能指标,并根据若干性能指标确定计算机系统是否处于空闲态;当计算机系统处于空闲态的情况下,对目标内存区域执行数据迁移和内存隔离。如此,通过在确定计算机系统已经处于空闲态的情况下才对发生可纠正错误的目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。

Description

内存错误处理方法及装置
本申请要求于2022年09月26日提交的申请号为202211172016.1、申请名称为“内存错误处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种内存错误处理方法及装置。
背景技术
随机存取存储器(random access memory,RAM)通常也被简称为内存,其是计算机系统的重要组成部件之一。内存发生可纠正错误(corrected error,CE)时,可以采用包含错误校验与校正(error checking and correction,ECC)在内的各种纠错算法进行纠错,而且可以采用包含自适应双设备数据校正(adaptive double device data correction,ADDDC)在内的各种技术实现对发生CE的内存区域进行数据迁移和内存隔离。
对内存区域执行数据迁移和内存隔离的过程中,通常会大量占用计算机系统的资源,可能导致计算机系统无法高效的执行其当前正在执行的其它业务。
发明内容
本申请实施例中至少提供了一种内存错误处理方法及装置,在需要对发生CE的目标内存区域执行数据迁移和内存隔离的情况下,可以根据计算机系统在当前时间间隔内的若干性能指标判断计算机系统是否处于空闲态,在确定计算机系统处于空闲态的情况下才对目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。
第一方面,提供了一种内存错误处理方法,该方法应用于包括内存的计算机系统。该方法包括:在需要对内存中发生可纠正错误CE的目标内存区域执行数据迁移和内存隔离的情况下,可以首先获取前述计算机系统在当前时间间隔内的若干性能指标,并根据该若干性能指标确定前述计算机系统是否处于空闲态;当确定前述计算机系统处于空闲态的情况下,对目标内存区域执行数据迁移和内存隔离。
如此,在需要对发生CE的目标内存区域执行数据迁移和内存隔离时,可以根据计算机系统在当前时间间隔内的若干性能指标判断计算机系统是否处于空闲态,并在确定计算机系统处于空闲态的情况下才对目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。
在一种可能的实施方式中,前述的若干性能指标可以包括但不限于如下各项性能指标中的任意一项或多项:计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖计算机系统并且处于繁忙状态的虚拟机是否与所述目标内存区域位于相同的非一致存储访问结构(non-uniform memory access,NUMA)。
在一种可能的实施方式中,该方法还包括:获取计算机系统的内存错误信息;根据内存错误信息确定内存中发生CE的目标内存区域和CE模式;根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。该实施方式中,由于并非全部的CE均可能影响内存区域在后续过程中继续发生不可纠正错误(uncorrected errors,UCE),因此并不将全部的CE均作为对发生CE的内存区域进行数据迁移和内存隔离的必要条件,可以避免因频繁执行对发生CE的内存区域进行数据迁移和隔离而带来其它问题。
在一种可能的实施方式中,根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离,包括:在CE模式属于预先配置的若干目标CE模式的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离,包括:在CE模式属于预先配置的若干目标CE模式的情况下,将目标内存区域发生属于若干目标CE模式的CE的频次加1;在执行加1操作后的频次达到预设阈值的情况下,确定需要对目标内存区域执行 数据迁移和内存隔离。
在一种可能的实施方式,前述的若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。
第二方面,提供了一种内存错误处理装置,该装置部署在包括内存的计算机系统中。该装置包括:指标获取模块,用于在需要对模块内存中发生可纠正错误CE的目标内存区域执行数据迁移和内存隔离的情况下,获取计算机系统在当前时间间隔内的若干性能指标;状态判断模块,用于根据若干性能指标确定计算机系统是否处于空闲态,并在计算机系统处于空闲态时触发隔离处理模块;隔离处理模块,用于在状态判断模块的触发下,对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,若干性能指标包括如下各项性能指标中的任意一项或多项:计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖计算机系统并且处于繁忙状态的虚拟机是否与目标内存区域位于相同的NUMA。
在一种可能的实施方式中,该装置还包括:信息获取模块,用于获取计算机系统的内存错误信息;故障分析模块,用于根据内存错误信息确定内存中发生CE的目标内存区域和CE模式;根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,故障分析模块,具体用于在CE属于预先配置的若干目标CE模式的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,故障分析模块,具体用于在CE模式属于预先配置的若干目标CE模式的情况下,将目标内存区域发生属于若干目标CE模式的CE的频次加1;在执行加1操作后的频次达到预设阈值的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。
第三方面,本申请实施例中提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现第一方面提供的方法。
第四方面,本申请实施例中提供了一种计算机系统,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现第一方面提供的方法。
第五方面,本申请实施例中提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机实现第一方面提供的方法。
第六方面,本申请实施例中提供了一种计算机程序或计算机程序产品,所述计算机程序或计算机程序产品包括指令,当所述指令被执行时,实现第一方面提供的方法。
第七方面,本申请的实施例中提供了一种芯片,该芯片包括至少一个处理器和接口,所述至少一个处理器通过所述接口确定程序指令或者数据;前述至少一个处理器用于执行所述程序指令,以实现第一方面提供的方法。
可以理解的是,前述第二方面至第七方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
图1为本申请实施例中提供的一种计算机系统的结构示意图之一;
图2为本申请实施例中提供的一种内存错误处理方法的流程图;
图3为本申请实施例中提供的一种计算机系统的结构示意图之二;
图4为本申请实施例中提供的一种计算机系统的结构示意图之三;
图5为本申请实施例中提供的一种内存错误处理装置的结构示意图;
图6为本申请实施例中提供的一种计算设备的示意图。
具体实施方式
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
计算机系统的内存所发生的错误,通常可以被划分为CE和UCE两种错误类型。对于CE而言,通常可以采用包含ECC在内的各种纠错算法对其进行纠错。对于UCE而言,其可能导致计算机系统执行的业 务因无法准确的访问发生UCE的内存区域而带来其它问题,甚至可能直接导致计算机系统停止运行。
对于发生UCE的内存区域,其在发生UCE前经常会发生属于特定模式的若干CE。通过对有限数据集进行分析发现,行(row)CE占比约17%、列(column)CE占比约15.3、bank CE占比约15.7,先发生row CE并且继续发生UCE的概率约25%,先发生行column CE并且继续发生UCE的概率约23.9%,先发生bank CE并且继续发生UCE的概率约22.6%。基于以上发现,可以确定某个内存区域发生属于特定模式的若干CE后,例如在发生属于row CE、column CE以及bank CE等CE模式的若干CE后,该内存区域可能继续发生UCE。因此,可以考虑在发现某个内存区域发生属于特定模式的若干CE后,对该内存区域进行数据迁移和内存隔离,使得计算机系统所执行的业务能够准确的访问原本存储于该内存区域的数据并不再继续访问该内存区域,降低内存发生UCE的频次以提高计算机系统的可用性。
示例性的,可以采用自适应双设备数据校正(adaptive double device data correction,ADDDC)技术实现对内存区域执行数据迁移和内存隔离。例如请参见图1所示所示的计算机系统,该计算机系统的处理器和基本输入输出系统(basic input output system,BIOS)可以各自实现为相应的固件,处理器可以通过其内存控制器连接若干双列直插式内存模块(dual inline memory modules,DIMM),例如通过单个内存通道连接DIMM0和DIMM1等两个DIMM。单个DIMM例如可以包括Rank0和两个rank1等两个rank;单个rank例如可以包括chip 00~chip 17等18个颗粒(chip),chip 17可以作为冗余颗粒;单个chip可以包括bank 0~bank n等n+1个逻辑bank。假设DIMM0的rank0中属于chip 00的bank n因发生CE错误而基于某些规则被判定为需要执行数据迁移和内存隔离,那么例如可以通过ADDDC技术将DIMM0的rank0中属于chip 00的bank n所存储的数据,迁移到DIMM1的rank0中属于chip 17的bank n以及DIMM0的rank0中属于chip 17的bank n,并对DIMM0的rank0中属于chip 00的bank n进行隔离。其中被迁移到DIMM1的rank0中属于chip 17的bank n的数据,以及被迁移到DIMM0的rank0中属于chip 17的bank n的数据,可以用于恢复原本存储于DIMM0的rank0中属于chip 00的bank n的数据。
前文虽然示例性描述了通过ADDDC技术实现对内存中发生CE的逻辑bank执行数据迁移和内存隔离,然而可以理解的是还可能通过其它技术实现对内存中发生CE的内存区域执行数据迁移和内存隔离,例如采用自适应型双颗粒数据纠正-多区域(adaptive double device data correction-multiple region,ADDDC-MR)、自适应型数据纠正-单区域(adaptive data correction-single region,ADC-SR)自适应型双颗粒错误纠正(adaptive double device error correction,ADDEC)等技术对内存区域进行数据迁移和内存隔离。
前文虽然示例性描述了对发生CE的bank执行数据迁移和内存隔离,然而可以理解的是发生CE的内存区域还可能是rank、chip、属于bank的row或属于bank的column等等。
对内存区域执行数据迁移和内存隔离时,将会大幅占用计算机系统的各项资源,进而可能影响计算机系统对其它业务的高效执行。在有限次数的实验分析中发现,通过ADDDC技术实现对内存区域执行数据迁移和内存隔离时,均会对存储带宽、转发带宽和处理器的数据处理时延等造成较大影响,其中最大数据输入时延达到710ms,最大数据输出时延达到63ms,处理器的性能下降约1%而且处理器占用率大幅上升的持续时间约10ms,甚至还可能导致依赖计算机系统的虚拟机复位以及导致数据库输入/输出报错等其它问题。
鉴于以上问题,本申请实施例中提供了一种内存错误处理方法及装置。在需要对发生CE的目标内存区域执行数据迁移和内存隔离的情况下,可以根据计算机系统在当前时间间隔内的若干性能指标判断计算机系统是否处于空闲态,并且在确定计算机系统处于空闲态的情况下才对目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。
示例性的,图2为本说明书实施例中提供的一种内存错误处理方法的流程图。其中该方法可以由处理器、包含处理器的计算设备/计算机系统执行;更具体地,处理器、包含处理器的计算设备/计算机系统可以执行计算机程序/指令以实现图2中所示的各个方法步骤。前述计算设备/计算机系统例如可以包括但不限于服务器、交换机、路由器、基站控制器、终端或者计算加速卡等等,前述的服务器通常可以是一体机,或者前述的服务器可以采用基于基板管理控制器(baseboard management controller,BMC)实现的分层云架构。请参见图2所示,该方法可以包括但不限于如下步骤S200~步骤S210中的部分或 全部。
步骤S200,获取计算机系统的内存错误信息。
当计算机系统的内存发生错误时,例如可以由该计算机系统的BIOS通过处理器的内存控制器获得相应的内存错误信息。请参见图3所示,当计算机系统是采用分层云架构的服务器时,前述内存错误信息例如还可以由该计算机系统的BIOS发送至该计算机系统的BMC。请参见图4所示,当计算机系统并非是采用分层云架构的服务器时,前述内存错误信息例如还可以由该计算机系统的BIOS发送至该计算机系统的系统管理单元。前述系统管理单元可以是该计算机系统中部署的操作系统(Operating System,OS),更具体地说可以是该计算机系统中部署的OS所包含的某个功能模块(例如故障分析模块),或者该系统管理单元也可以是该计算机系统中除其部署的OS以外的其它固件。
步骤S202,根据内存错误信息确定计算机系统的内存中发生CE的目标内存区域以及所发生CE的CE模式。
当计算机系统包括BMC时,例如可以由该计算机系统的BMC实现根据内存错误信息确定发生CE的目标内存区域以及所发生CE的CE模式。当计算机系统并不包括BMC时,例如可以由该计算机系统的系统管理单元实现根据内存错误信息确定发生CE的目标内存区域以及所发生CE的CE模式。具体地,可以对内存错误信息进行特征分析以确定目标内存区域所发生CE是否符合相应的CE模式;或者,可以采用机器学习的方式对内存错误信息以及与内存运行状态相关的其它数据进行分析,更加准确的确定目标内存区域所发生CE的CE模式。CE模式可以包括row CE、column CE、bank CE、chip CE以及rank CE等等。
步骤S204,根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。
当计算机系统包括BMC时,例如可以由该计算机系统的BMC实现根据步骤S202确定的CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。当计算机系统不包括BMC时,例如可以由该计算机系统的系统管理单元实现根据步骤S202确定的CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,当步骤S202中确定的CE模式属于预先配置的若干目标CE模式时,步骤S204中可以确定需要对目标内存区域执行数据迁移和内存隔离;反之,当步骤S202中确定的CE模式不属于预先配置的若干目标CE模式时,步骤S204中可以确定无需对目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,当步骤S202中确定的CE模式属于预先配置的若干目标CE模式时,步骤S204中可以将目标内存区域发生的属于若干目标CE模式的CE的频次加1,如果执行加1操作后的频次达到预设阈值,则确定需要对目标内存区域执行数据迁移和内存隔离;反之,如果执行加1操作后的频次并未达到预设阈值,则确定无需对目标内存区域执行数据迁移和内存隔离。
前述若干目标CE模式可以包括但不限于:row CE、column CE以及bank CE。
当前述步骤S204确定需要对目标内存区域执行数据迁移和内存隔离时,继续执行图下步骤S206,获取计算机系统在当前时间间隔内的若干性能指标。
步骤S208,根据若干性能指标确定计算机系统是否处于空闲态。
可以由计算机系统的系统管理单元实现前述步骤S208。
前述若干性能指标可以包括但不限于如下各项性能指标中的任意一项或多项:计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖计算机系统并且处于繁忙状态的虚拟机是否与目标内存区域位于相同的NUMA。其中,内存带宽是总线宽度、总线频率以及时钟周期内交换的数据包个数的乘积;转发带宽是指单位时间内能够在线路上传送的数据量,单位是bps(bit per second);存储带宽是指单位时间内存储器所存取的数据量,也称为存储器在单位时间内读出/写入的位数或字节。
在一种可能的实施方式中,当计算机系统运行在用户态、依赖计算机系统并且处于繁忙状态的虚拟机与目标内存区域位于不同的NUMA时,可以进一步基于预先配置的业务规则确定当前时间间隔内的其余各项性能指标分别对应的业务分值,然后对各个业务分值进行加权求和以得到总分值,进而基于总分值的大小确定计算机系统是否处于空闲态。
在一种可能的实施方式中,当计算机系统运行在用户态、依赖计算机系统并且处于繁忙状态的虚拟 机与目标内存区域位于不同的NUMA,而且处理器占用率、内存带宽、转发带宽、存储带宽等性能指标均小于其各自对应的预设参考值时,确定计算机系统处于空闲态。
需要特别说明的是,计算机系统中可能并不存在处于繁忙状态的虚拟机,此种情况下所获取的计算机系统在当前时间间隔内的若干性能指标,可能并不包括依赖计算机系统并且处于繁忙状态的虚拟机是否与目标内存区域位于相同的NUMA。
总而言之,计算机系统处于空闲态时,计算机系统应当运行在用户态,依赖计算机系统并且处于繁忙状态的虚拟机与目标内存区域应当位于不同的NUMA,除此之外处理器占用率、内存带宽、转发带宽、存储带宽等各项指标应当具有相对较小的值,确保计算机系统有足够的资源来支持对目标内存区域执行数据迁移和内存隔离,从而避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其需要执行的其它业务的高效执行。
当步骤S208中根据计算机系统在当前时间间隔内的若干性能指标确定计算机系统并未处于空闲态时,可以按照相应的时间间隔周期性的执行前述步骤S206和步骤S208,直到确定出计算机系统处于空闲态时,执行如下步骤S210。
步骤S210,对目标内存区域执行数据迁移和内存隔离。
示例性的,计算机系统的系统管理单元可以通过该计算机系统的BIOS触发该计算机系统的处理器对目标内存区域执行数据迁移和内核隔离。参照前文所述,可以采用ADDDC技术实现对目标内存区域进行数据迁移和内存隔离,此外也可能采用自适应型双颗粒数据纠正-多区域(adaptive double device data correction-multiple region,ADDDC-MR)、自适应型数据纠正-单区域(adaptive data correction-single region,ADC-SR)自适应型双颗粒错误纠正(adaptive double device error correction,ADDEC)等技术实现对目标内存区域进行数据迁移和内存隔离。
与前述方法实施例基于相同的构思,本申请实施例中还提供了一种内存错误处理装置,所述装置部署在包括内存的计算机系统中。如图5所示,所述内存错误处理装置50包括:指标获取模块501,用于在需要对所述内存中发生CE的目标内存区域执行数据迁移和内存隔离的情况下,获取所述计算机系统在当前时间间隔内的若干性能指标;状态判断模块503,用于根据所述若干性能指标确定所述计算机系统是否处于空闲态,并在所述计算机系统处于空闲态时触发隔离处理模块;所述隔离处理模块505,用于在所述状态判断模块的触发下,对所述目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,所述若干性能指标包括如下各项性能指标中的任意一项或多项:所述计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖所述计算机系统并且处于繁忙状态的虚拟机是否与所述目标内存区域位于相同的非一致存储访问结构NUMA。
在一种可能的实施方式中,所述装置还包括:信息获取模块507,用于获取所述计算机系统的内存错误信息;故障分析模块509,用于根据所述内存错误信息确定所述内存中发生CE的目标内存区域和CE模式;根据所述CE模式确定是否需要对所述目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,所述故障分析模块509,用于在所述CE模式属于预先配置的若干目标CE模式的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,所述故障分析模块509,用于在所述CE模式属于预先配置的若干目标CE模式的情况下,将所述目标内存区域发生属于所述若干目标CE模式的CE的频次加1;在执行加1操作后的所述频次达到预设阈值的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。
在一种可能的实施方式中,所述若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。
根据本申请实施例的内存错误处理装置50可对应于执行本申请实施例中描述的方法,并且内存错误处理装置50中的各个模块的所分别执行的前述各项操作和其它操作和/或功能分别为了实现图2中的各个方法的相应流程,为了简洁,在此不再赘述。
根据本申请实施例的内存错误处理装置50所包括的指标获取模块501、状态判断模块503、隔离处理模块505、信息获取模块507和故障分析模块509,可以通过软件实现,或者可以通过硬件实现。示例性的,接下来以指标获取模块501为例,介绍指标获取模块501的实现方式。类似的,状态判断模块503、隔离处理模块505、信息获取模块507和故障分析模块509的实现方式可以参考指标获取模块501的实现方式。
模块作为软件功能模块的一种举例,指标获取模块501可以包括运行在计算实例上的代码。计算实例可以包括物理主机(计算设备)、虚拟机、容器中的一种。
模块作为硬件功能模块的一种举例,指标获取模块501可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或者可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
与前述的方法的实施例基于相同的构思,本申请实施例中还提供了一种计算设备和一种计算机系统,该计算设备/计算机系统至少包括处理器和存储器,存储器上存储有程序,处理器该程序时,可以实现图2所示的方法中的各个步骤的单元或模块。
图6为本申请实施例中提供的一种计算设备的结构示意图。
如图6所示,所述计算设备600包括至少一个处理器601、存储器602和通信接口603。其中,处理器601、存储器602和通信接口603通信连接,可以通过有线(例如总线)的方式实现通信连接,也可以通过无线的方式实现通信连接。该通信接口603用于接收其他设备发送的数据(例如写入数据);存储器602存储有计算机指令,处理器601执行该计算机指令,执行前述方法实施例中的方法。
应理解,在本申请实施例中,该处理器601可以包括中央处理单元CPU,该处理器601还可以包括其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
该存储器602可以包括只读存储器和随机存取存储器,并向处理器601提供指令和数据。存储器602还可以包括非易失性随机存取存储器。
该存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
应理解,根据本申请实施例的计算设备600可以执行实现本申请实施例中图2所示方法,该方法实现的详细描述参见上文,为了简洁,在此不再赘述。
本申请的实施例中提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机指令在被处理器执行时,使得上文提及的方法被实现。
本申请的实施例中提供了一种芯片,该芯片包括至少一个处理器和接口,所述至少一个处理器通过所述接口确定程序指令或者数据;前述至少一个处理器用于执行所述程序指令,以实现上文提及的方法。
本申请的实施例中提供了一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括指令,当该指令执行时,令计算机执行上文提及的方法。
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存 储介质中。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种内存错误处理方法,其特征在于,应用于包括内存的计算机系统,包括:
    在需要对所述内存中发生可纠正错误CE的目标内存区域执行数据迁移和内存隔离的情况下,获取所述计算机系统在当前时间间隔内的若干性能指标;
    根据所述若干性能指标确定所述计算机系统是否处于空闲态;
    当所述计算机系统处于空闲态时,对所述目标内存区域执行数据迁移和内存隔离。
  2. 根据权利要求1所述的方法,其特征在于,所述若干性能指标包括如下各项性能指标中的任意一项或多项:所述计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖所述计算机系统并且处于繁忙状态的虚拟机是否与所述目标内存区域位于相同的非一致存储访问结构NUMA。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取所述计算机系统的内存错误信息;
    根据所述内存错误信息确定所述内存中发生CE的目标内存区域和CE模式;
    根据所述CE模式确定是否需要对所述目标内存区域执行数据迁移和内存隔离。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述CE模式确定是否需要对所述目标内存区域执行数据迁移和内存隔离,包括:在所述CE模式属于预先配置的若干目标CE模式的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述CE模式确定是否需要对所述目标内存区域执行数据迁移和内存隔离,包括:
    在所述CE模式属于预先配置的若干目标CE模式的情况下,将所述目标内存区域发生属于所述若干目标CE模式的CE的频次加1;
    在执行加1操作后的所述频次达到预设阈值的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。
  6. 根据权利要求4或5中所述的方法,其特征在于,所述若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。
  7. 一种内存错误处理装置,其特征在于,部署在包括内存的计算机系统中,包括:
    指标获取模块,用于在需要对所述内存中发生可纠正错误CE的目标内存区域执行数据迁移和内存隔离的情况下,获取所述计算机系统在当前时间间隔内的若干性能指标;
    状态判断模块,用于根据所述若干性能指标确定所述计算机系统是否处于空闲态,并在所述计算机系统处于空闲态时触发隔离处理模块;
    所述隔离处理模块,用于在所述状态判断模块的触发下,对所述目标内存区域执行数据迁移和内存隔离。
  8. 根据权利要求7所述的装置,其特征在于,所述若干性能指标包括如下各项性能指标中的任意一项或多项:所述计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖所述计算机系统并且处于繁忙状态的虚拟机是否与所述目标内存区域位于相同的非一致存储访问结构NUMA。
  9. 根据权利要求7所述的装置,其特征在于,所述装置还包括:
    信息获取模块,用于获取所述计算机系统的内存错误信息;
    故障分析模块,用于根据所述内存错误信息确定所述内存中发生CE的目标内存区域和CE模式;根据所述CE模式确定是否需要对所述目标内存区域执行数据迁移和内存隔离。
  10. 根据权利要求9所述的装置,其特征在于,所述故障分析模块,具体用于在所述CE模式属于预先配置的若干目标CE模式的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。
  11. 根据权利要求9所述的装置,其特征在于,所述故障分析模块,具体用于在所述CE模式属于预先配置的若干目标CE模式的情况下,将所述目标内存区域发生属于所述若干目标CE模式的CE的频次加1;在执行加1操作后的所述频次达到预设阈值的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。
  12. 根据权利要求10或11中所述的装置,其特征在于,所述若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。
  13. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现权利要求1-6中任一项所述的方法。
  14. 一种计算机系统,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现权利要求1-6中任一项所述的方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-6中任一项所述的方法。
PCT/CN2023/101096 2022-09-26 2023-06-19 内存错误处理方法及装置 WO2024066500A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211172016.1A CN117806855A (zh) 2022-09-26 2022-09-26 内存错误处理方法及装置
CN202211172016.1 2022-09-26

Publications (1)

Publication Number Publication Date
WO2024066500A1 true WO2024066500A1 (zh) 2024-04-04

Family

ID=90418696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101096 WO2024066500A1 (zh) 2022-09-26 2023-06-19 内存错误处理方法及装置

Country Status (2)

Country Link
CN (1) CN117806855A (zh)
WO (1) WO2024066500A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1834928A (zh) * 2005-03-17 2006-09-20 富士通株式会社 软错误纠正方法、存储控制设备及存储系统
CN104077375A (zh) * 2014-06-24 2014-10-01 华为技术有限公司 一种cc-numa系统中节点的错误目录的处理方法和节点
US20160307645A1 (en) * 2015-04-20 2016-10-20 Qualcomm Incorporated Method and apparatus for in-system management and repair of semi-conductor memory failure
CN112231128A (zh) * 2020-09-11 2021-01-15 中科可控信息产业有限公司 内存错误处理方法、装置、计算机设备和存储介质
CN113868001A (zh) * 2021-09-10 2021-12-31 苏州浪潮智能科技有限公司 一种内存修复结果的检查方法、系统及计算机存储介质
CN115016963A (zh) * 2022-05-06 2022-09-06 阿里巴巴(中国)有限公司 内存页隔离方法、内存监控系统及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1834928A (zh) * 2005-03-17 2006-09-20 富士通株式会社 软错误纠正方法、存储控制设备及存储系统
CN104077375A (zh) * 2014-06-24 2014-10-01 华为技术有限公司 一种cc-numa系统中节点的错误目录的处理方法和节点
US20160307645A1 (en) * 2015-04-20 2016-10-20 Qualcomm Incorporated Method and apparatus for in-system management and repair of semi-conductor memory failure
CN112231128A (zh) * 2020-09-11 2021-01-15 中科可控信息产业有限公司 内存错误处理方法、装置、计算机设备和存储介质
CN113868001A (zh) * 2021-09-10 2021-12-31 苏州浪潮智能科技有限公司 一种内存修复结果的检查方法、系统及计算机存储介质
CN115016963A (zh) * 2022-05-06 2022-09-06 阿里巴巴(中国)有限公司 内存页隔离方法、内存监控系统及计算机可读存储介质

Also Published As

Publication number Publication date
CN117806855A (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
US11232848B2 (en) Memory module error tracking
CN108268340B (zh) 校正存储器中的错误的方法
US20160055059A1 (en) Memory devices and modules
US9411743B2 (en) Detecting memory corruption
JP6815723B2 (ja) メモリシステム及びその動作方法
TW202006548A (zh) 儲存裝置以及多晶片系統
US11080135B2 (en) Methods and apparatus to perform error detection and/or correction in a memory device
JP2006092537A (ja) マージバッファのシステムキルエラーのプロセスキルエラーへの変換技術
US11960350B2 (en) System and method for error reporting and handling
US20180276161A1 (en) PCIe VIRTUAL SWITCHES AND AN OPERATING METHOD THEREOF
US8261134B2 (en) Error management watchdog timers in a multiprocessor computer
CN103984506B (zh) 闪存存储设备数据写的方法和系统
US11003606B2 (en) DMA-scatter and gather operations for non-contiguous memory
CN106445720A (zh) 一种内存错误恢复方法和装置
CN115168088A (zh) 一种针对内存的不可纠正错误的修复方法及装置
CN115328684A (zh) 内存故障的上报方法、bmc及电子设备
CN115168087A (zh) 一种确定内存故障的修复资源粒度的方法及装置
WO2024066500A1 (zh) 内存错误处理方法及装置
US20120017116A1 (en) Memory control device, memory device, and memory control method
EP4280064A1 (en) Systems and methods for expandable memory error handling
US20220350500A1 (en) Embedded controller and memory to store memory error information
US11755235B2 (en) Increasing random access bandwidth of a DDR memory in a counter application
US9251054B2 (en) Implementing enhanced reliability of systems utilizing dual port DRAM
CN116483612B (zh) 内存故障处理方法、装置、计算机设备和存储介质
CN116401085A (zh) 内存异常处理方法、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23869748

Country of ref document: EP

Kind code of ref document: A1