WO2024066500A1 - Procédé et appareil de traitement d'erreur de mémoire - Google Patents

Procédé et appareil de traitement d'erreur de mémoire Download PDF

Info

Publication number
WO2024066500A1
WO2024066500A1 PCT/CN2023/101096 CN2023101096W WO2024066500A1 WO 2024066500 A1 WO2024066500 A1 WO 2024066500A1 CN 2023101096 W CN2023101096 W CN 2023101096W WO 2024066500 A1 WO2024066500 A1 WO 2024066500A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
computer system
target
memory area
isolation
Prior art date
Application number
PCT/CN2023/101096
Other languages
English (en)
Chinese (zh)
Inventor
买培培
吕洪发
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024066500A1 publication Critical patent/WO2024066500A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance

Definitions

  • the present application relates to the field of computer technology, and in particular to a memory error processing method and device.
  • RAM Random access memory
  • ECC error checking and correction
  • ADDDC adaptive double device data correction
  • the process of performing data migration and memory isolation on the memory area usually occupies a large amount of computer system resources, which may cause the computer system to be unable to efficiently execute other businesses that are currently being executed.
  • At least one memory error handling method and device is provided.
  • it can be determined whether the computer system is in an idle state based on several performance indicators of the computer system within the current time interval. Data migration and memory isolation are performed on the target memory area only when it is determined that the computer system is in an idle state. This can avoid affecting the efficient execution of other services of the computer system due to the performance of data migration and memory isolation on the target memory area.
  • a memory error handling method is provided, which is applied to a computer system including a memory.
  • the method comprises: when it is necessary to perform data migration and memory isolation on a target memory area where a correctable error CE occurs in the memory, several performance indicators of the aforementioned computer system in a current time interval can be first obtained, and whether the aforementioned computer system is in an idle state can be determined based on the several performance indicators; when it is determined that the aforementioned computer system is in an idle state, data migration and memory isolation are performed on the target memory area.
  • the aforementioned several performance indicators may include, but are not limited to, any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same non-uniform memory access structure (NUMA) as the target memory area.
  • NUMA non-uniform memory access structure
  • the method further includes: obtaining memory error information of the computer system; determining the target memory area and CE mode in the memory where CE occurs according to the memory error information; and determining whether it is necessary to perform data migration and memory isolation on the target memory area according to the CE mode.
  • UCE uncorrected errors
  • determining whether data migration and memory isolation need to be performed on the target memory area is determined according to the CE mode, including: when the CE mode belongs to several pre-configured target CE modes, determining that data migration and memory isolation need to be performed on the target memory area.
  • determining whether data migration and memory isolation need to be performed on the target memory area according to the CE mode includes: when the CE mode belongs to a plurality of pre-configured target CE modes, adding 1 to the frequency of CEs that occur in the target memory area that belong to the plurality of target CE modes; and when the frequency after the addition operation reaches a preset threshold, determining that data migration and memory isolation need to be performed on the target memory area. Data migration and memory isolation.
  • the aforementioned several target CE modes include at least one of the following CE modes: row CE, column CE and bank CE.
  • a memory error processing device which is deployed in a computer system including a memory.
  • the device includes: an indicator acquisition module, which is used to acquire several performance indicators of the computer system in the current time interval when it is necessary to perform data migration and memory isolation on the target memory area where a correctable error CE occurs in the module memory; a state judgment module, which is used to determine whether the computer system is in an idle state based on several performance indicators, and trigger the isolation processing module when the computer system is in an idle state; and the isolation processing module, which is used to perform data migration and memory isolation on the target memory area under the triggering of the state judgment module.
  • the plurality of performance indicators include any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same NUMA as the target memory area.
  • the device also includes: an information acquisition module, used to obtain memory error information of the computer system; a fault analysis module, used to determine the target memory area and CE mode where CE occurs in the memory based on the memory error information; and determine whether data migration and memory isolation need to be performed on the target memory area based on the CE mode.
  • the fault analysis module is specifically configured to determine that data migration and memory isolation need to be performed on a target memory area when the CE belongs to several pre-configured target CE modes.
  • the fault analysis module is specifically used to increase the frequency of CE belonging to several target CE modes occurring in the target memory area by 1 when the CE mode belongs to several pre-configured target CE modes; when the frequency after performing the addition operation reaches a preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area.
  • the several target CE modes include at least one of the following CE modes: row CE, column CE, and bank CE.
  • an embodiment of the present application provides a computing device, including a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method provided in the first aspect.
  • an embodiment of the present application provides a computer system, comprising a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method provided in the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program When executed in a computer, the computer implements the method provided in the first aspect.
  • a computer program or a computer program product is provided in an embodiment of the present application, wherein the computer program or the computer program product comprises instructions, and when the instructions are executed, the method provided in the first aspect is implemented.
  • a chip is provided in an embodiment of the present application, the chip comprising at least one processor and an interface, wherein the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method provided in the first aspect.
  • FIG1 is a schematic diagram of a computer system provided in an embodiment of the present application.
  • FIG2 is a flow chart of a memory error handling method provided in an embodiment of the present application.
  • FIG3 is a second schematic diagram of the structure of a computer system provided in an embodiment of the present application.
  • FIG4 is a third structural diagram of a computer system provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of a memory error handling device provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a computing device provided in an embodiment of the present application.
  • CE errors that occur in the memory of a computer system
  • UCE various error correction algorithms including ECC can usually be used to correct it.
  • ECC error correction algorithms
  • UCE it may cause the computer system to perform operations that are not supported by the computer system. The inability to accurately access the memory area where the UCE occurs may cause other problems and may even directly cause the computer system to stop running.
  • ADDDC adaptive double device data correction
  • the processor and basic input output system (BIOS) of the computer system can be implemented as corresponding firmware respectively, and the processor can connect several dual inline memory modules (DIMMs) through its memory controller, for example, two DIMMs such as DIMM0 and DIMM1 are connected through a single memory channel.
  • DIMMs dual inline memory modules
  • a single DIMM can include two ranks such as Rank0 and two ranks1; a single rank can include 18 chips such as chip 00 to chip 17, and chip 17 can be used as a redundant chip; a single chip can include n+1 logical banks such as bank 0 to bank n.
  • bank n of chip 00 in rank 0 of DIMM0 is determined to need data migration and memory isolation based on certain rules due to a CE error
  • the data stored in bank n of chip 00 in rank 0 of DIMM0 can be migrated to bank n of chip 17 in rank 0 of DIMM1 and bank n of chip 17 in rank 0 of DIMM0 through the ADDDC technology, and bank n of chip 00 in rank 0 of DIMM0 is isolated.
  • the data migrated to bank n of chip 17 in rank 0 of DIMM1 and the data migrated to bank n of chip 17 in rank 0 of DIMM0 can be used to recover the data originally stored in bank n of chip 00 in rank 0 of DIMM0.
  • ADDDC-MR adaptive double device data correction-multiple region
  • ADC-SR adaptive data correction-single region
  • ADDEC adaptive double device error correction
  • the memory area where CE occurs may also be a rank, a chip, a row belonging to a bank, or a column belonging to a bank, and so on.
  • a memory error handling method and device are provided in an embodiment of the present application.
  • it can be determined whether the computer system is in an idle state based on several performance indicators of the computer system in the current time interval, and data migration and memory isolation are performed on the target memory area only when it is determined that the computer system is in an idle state, so as to avoid affecting the efficient execution of other services of the computer system due to the execution of data migration and memory isolation on the target memory area.
  • FIG2 is a flowchart of a memory error handling method provided in an embodiment of the present specification.
  • the method may be executed by a processor, a computing device/computer system including a processor; more specifically, the processor, a computing device/computer system including a processor may execute a computer program/instruction to implement the various method steps shown in FIG2.
  • the aforementioned computing device/computer system may, for example, include but is not limited to a server, a switch, a router, a base station controller, a terminal or a computing acceleration card, etc.
  • the aforementioned server may generally be an all-in-one machine, or the aforementioned server may adopt a layered cloud architecture implemented based on a baseboard management controller (baseboard management controller, BMC). Please refer to FIG2.
  • the method may include but is not limited to part or all of the following steps S200 to S210. all.
  • Step S200 obtaining memory error information of the computer system.
  • the BIOS of the computer system can obtain corresponding memory error information through the memory controller of the processor.
  • the aforementioned memory error information can also be sent by the BIOS of the computer system to the BMC of the computer system, for example.
  • the aforementioned memory error information can also be sent by the BIOS of the computer system to the system management unit of the computer system, for example.
  • the aforementioned system management unit can be an operating system (OS) deployed in the computer system, and more specifically, it can be a functional module (such as a fault analysis module) included in the OS deployed in the computer system, or the system management unit can also be other firmware in the computer system other than the OS deployed therein.
  • OS operating system
  • the system management unit can also be other firmware in the computer system other than the OS deployed therein.
  • Step S202 determining a target memory area where CE occurs in the memory of the computer system and a CE mode of the CE that occurs according to the memory error information.
  • the BMC of the computer system can be used to determine the target memory area where CE occurs and the CE mode of the CE that occurs according to the memory error information.
  • the system management unit of the computer system can be used to determine the target memory area where CE occurs and the CE mode of the CE that occurs according to the memory error information.
  • feature analysis can be performed on the memory error information to determine whether the CE that occurs in the target memory area meets the corresponding CE mode; or, machine learning can be used to analyze the memory error information and other data related to the memory operating status to more accurately determine the CE mode of the CE that occurs in the target memory area.
  • CE modes may include row CE, column CE, bank CE, chip CE, and rank CE, etc.
  • Step S204 Determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode.
  • the BMC of the computer system can determine whether it is necessary to perform data migration and memory isolation on the target memory area according to the CE mode determined in step S202.
  • the system management unit of the computer system can determine whether it is necessary to perform data migration and memory isolation on the target memory area according to the CE mode determined in step S202.
  • step S204 when the CE mode determined in step S202 belongs to several pre-configured target CE modes, it can be determined in step S204 that data migration and memory isolation need to be performed on the target memory area; conversely, when the CE mode determined in step S202 does not belong to several pre-configured target CE modes, it can be determined in step S204 that data migration and memory isolation do not need to be performed on the target memory area.
  • the frequency of CE belonging to several target CE modes occurring in the target memory area can be increased by 1 in step S204. If the frequency after the addition operation reaches a preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area; conversely, if the frequency after the addition operation does not reach the preset threshold, it is determined that data migration and memory isolation do not need to be performed on the target memory area.
  • the aforementioned target CE modes may include but are not limited to: row CE, column CE and bank CE.
  • step S204 determines that data migration and memory isolation need to be performed on the target memory area
  • step S206 is continued to be executed to obtain several performance indicators of the computer system in the current time interval.
  • Step S208 determining whether the computer system is in an idle state according to a number of performance indicators.
  • the aforementioned step S208 may be implemented by a system management unit of the computer system.
  • the aforementioned performance indicators may include, but are not limited to, any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether the virtual machine that depends on the computer system and is in a busy state is located in the same NUMA as the target memory area.
  • memory bandwidth is the product of bus width, bus frequency, and the number of data packets exchanged in a clock cycle; forwarding bandwidth refers to the amount of data that can be transmitted on the line per unit time, and the unit is bps (bit per second); storage bandwidth refers to the amount of data accessed by the memory per unit time, also known as the number of bits or bytes read/written by the memory per unit time.
  • the business scores corresponding to the remaining performance indicators in the current time interval can be further determined based on pre-configured business rules, and then the weighted sum of each business score is performed to obtain a total score, and then whether the computer system is in an idle state is determined based on the size of the total score.
  • a virtual machine that relies on the computer system and is in a busy state
  • performance indicators such as processor occupancy, memory bandwidth, forwarding bandwidth, and storage bandwidth are all less than their respective corresponding preset reference values, it is determined that the computer system is in an idle state.
  • the performance indicators of the computer system obtained in the current time interval may not include whether the virtual machines that depend on the computer system and are busy are located in the same NUMA as the target memory area.
  • the computer system when the computer system is in idle state, the computer system should run in user state.
  • the virtual machines that rely on the computer system and are in busy state should be located in different NUMAs from the target memory area.
  • various indicators such as processor occupancy, memory bandwidth, forwarding bandwidth, and storage bandwidth should have relatively small values to ensure that the computer system has sufficient resources to support data migration and memory isolation of the target memory area, thereby avoiding affecting the efficient execution of other services that the computer system needs to execute due to data migration and memory isolation of the target memory area.
  • step S208 When it is determined in step S208 that the computer system is not in an idle state based on several performance indicators of the computer system in the current time interval, the aforementioned steps S206 and S208 can be periodically executed at corresponding time intervals until it is determined that the computer system is in an idle state, and then the following step S210 is executed.
  • Step S210 performing data migration and memory isolation on the target memory area.
  • the system management unit of the computer system can trigger the processor of the computer system to perform data migration and kernel isolation on the target memory area through the BIOS of the computer system.
  • the ADDDC technology can be used to achieve data migration and memory isolation on the target memory area.
  • adaptive double device data correction-multiple region ADDDC-MR
  • adaptive data correction-single region ADC-SR
  • adaptive double device error correction ADDEC
  • other technologies may also be used to achieve data migration and memory isolation on the target memory area.
  • the memory error handling device 50 includes: an indicator acquisition module 501, which is used to obtain several performance indicators of the computer system within the current time interval when it is necessary to perform data migration and memory isolation on the target memory area where CE occurs in the memory; a state judgment module 503, which is used to determine whether the computer system is in an idle state based on the several performance indicators, and trigger the isolation processing module when the computer system is in an idle state; the isolation processing module 505, which is used to perform data migration and memory isolation on the target memory area under the triggering of the state judgment module.
  • the several performance indicators include any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same non-uniform memory access structure NUMA as the target memory area.
  • the device also includes: an information acquisition module 507, used to obtain memory error information of the computer system; a fault analysis module 509, used to determine the target memory area and CE mode where CE occurs in the memory based on the memory error information; and determine whether it is necessary to perform data migration and memory isolation on the target memory area based on the CE mode.
  • an information acquisition module 507 used to obtain memory error information of the computer system
  • a fault analysis module 509 used to determine the target memory area and CE mode where CE occurs in the memory based on the memory error information; and determine whether it is necessary to perform data migration and memory isolation on the target memory area based on the CE mode.
  • the fault analysis module 509 is used to determine that data migration and memory isolation need to be performed on the target memory area when the CE mode belongs to several pre-configured target CE modes.
  • the fault analysis module 509 is used to, when the CE mode belongs to several pre-configured target CE modes, increase the frequency of CE occurring in the target memory area belonging to the several target CE modes by 1; when the frequency after performing the addition operation reaches a preset threshold, determine that data migration and memory isolation need to be performed on the target memory area.
  • the several target CE modes include at least one of the following CE modes: row CE, column CE and bank CE.
  • the memory error handling device 50 may correspond to executing the method described in the embodiment of the present application, and the aforementioned operations and other operations and/or functions respectively performed by each module in the memory error handling device 50 are respectively for realizing the corresponding processes of each method in Figure 2, which will not be repeated here for the sake of brevity.
  • the indicator acquisition module 501, the state judgment module 503, the isolation processing module 505, the information acquisition module 507 and the fault analysis module 509 included in the device can be implemented by software or by hardware.
  • the implementation of the indicator acquisition module 501 is introduced below by taking the indicator acquisition module 501 as an example.
  • the implementation of the state judgment module 503, the isolation processing module 505, the information acquisition module 507 and the fault analysis module 509 can refer to the implementation of the indicator acquisition module 501.
  • the indicator acquisition module 501 may include code running on a computing instance.
  • the computing instance may include a physical host (computing device), a virtual machine, or a container.
  • the indicator acquisition module 501 can be a device implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • the PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the computing device/computer system includes at least a processor and a memory, and a program is stored in the memory.
  • the processor executes the program, it can implement the units or modules of each step in the method shown in Figure 2.
  • FIG6 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • the computing device 600 includes at least one processor 601, a memory 602, and a communication interface 603.
  • the processor 601, the memory 602, and the communication interface 603 are connected in communication, and the communication connection can be realized by wired means (such as a bus) or by wireless means.
  • the communication interface 603 is used to receive data (such as write data) sent by other devices; the memory 602 stores computer instructions, and the processor 601 executes the computer instructions to execute the method in the aforementioned method embodiment.
  • the processor 601 may include a central processing unit CPU, and the processor 601 may also include other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the memory 602 may include a read-only memory and a random access memory, and provides instructions and data to the processor 601.
  • the memory 602 may also include a nonvolatile random access memory.
  • the memory 602 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the nonvolatile memory may be a read-only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static RAM
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • DR RAM direct rambus RAM
  • computing device 600 can execute the method shown in Figure 2 in the embodiment of the present application.
  • the detailed description of the implementation of the method is shown above, and for the sake of brevity, it will not be repeated here.
  • a computer-readable storage medium on which a computer program is stored.
  • the computer instructions are executed by a processor, the above-mentioned method is implemented.
  • a chip is provided in an embodiment of the present application.
  • the chip includes at least one processor and an interface.
  • the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method mentioned above.
  • a computer program or a computer program product is provided in an embodiment of the present application.
  • the computer program or the computer program product includes instructions. When the instructions are executed, the computer is caused to execute the above-mentioned method.
  • the steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be implemented using hardware, a software module executed by a processor, or a combination of the two.
  • the software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage known in the art. storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

L'invention concerne un procédé de traitement d'erreur de mémoire, qui est appliqué à un système informatique, le système informatique comprenant une mémoire. Le procédé consiste à : lorsqu'il est déterminé qu'une migration de données et une isolation de mémoire doivent être effectuées sur une zone de mémoire cible où une erreur corrigée se produit dans une mémoire, acquérir plusieurs indicateurs de performance d'un système informatique dans l'intervalle de temps courant, et déterminer, en fonction des plusieurs indicateurs de performance, si le système informatique est dans un état inactif ; et lorsque le système informatique est dans l'état inactif, effectuer une migration de données et une isolation de mémoire sur la zone de mémoire cible. De cette manière, une migration de données et une isolation de mémoire sont effectuées sur la zone de mémoire cible où une erreur corrigée se produit uniquement lorsqu'il est déterminé qu'un système informatique est déjà dans un état inactif, de telle sorte qu'il est possible d'éviter d'avoir un impact sur l'exécution efficace du système informatique par rapport à d'autres services en raison de la réalisation d'une migration de données et d'une isolation de mémoire sur la zone de mémoire cible.
PCT/CN2023/101096 2022-09-26 2023-06-19 Procédé et appareil de traitement d'erreur de mémoire WO2024066500A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211172016.1A CN117806855A (zh) 2022-09-26 2022-09-26 内存错误处理方法及装置
CN202211172016.1 2022-09-26

Publications (1)

Publication Number Publication Date
WO2024066500A1 true WO2024066500A1 (fr) 2024-04-04

Family

ID=90418696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101096 WO2024066500A1 (fr) 2022-09-26 2023-06-19 Procédé et appareil de traitement d'erreur de mémoire

Country Status (2)

Country Link
CN (1) CN117806855A (fr)
WO (1) WO2024066500A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1834928A (zh) * 2005-03-17 2006-09-20 富士通株式会社 软错误纠正方法、存储控制设备及存储系统
CN104077375A (zh) * 2014-06-24 2014-10-01 华为技术有限公司 一种cc-numa系统中节点的错误目录的处理方法和节点
US20160307645A1 (en) * 2015-04-20 2016-10-20 Qualcomm Incorporated Method and apparatus for in-system management and repair of semi-conductor memory failure
CN112231128A (zh) * 2020-09-11 2021-01-15 中科可控信息产业有限公司 内存错误处理方法、装置、计算机设备和存储介质
CN113868001A (zh) * 2021-09-10 2021-12-31 苏州浪潮智能科技有限公司 一种内存修复结果的检查方法、系统及计算机存储介质
CN115016963A (zh) * 2022-05-06 2022-09-06 阿里巴巴(中国)有限公司 内存页隔离方法、内存监控系统及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1834928A (zh) * 2005-03-17 2006-09-20 富士通株式会社 软错误纠正方法、存储控制设备及存储系统
CN104077375A (zh) * 2014-06-24 2014-10-01 华为技术有限公司 一种cc-numa系统中节点的错误目录的处理方法和节点
US20160307645A1 (en) * 2015-04-20 2016-10-20 Qualcomm Incorporated Method and apparatus for in-system management and repair of semi-conductor memory failure
CN112231128A (zh) * 2020-09-11 2021-01-15 中科可控信息产业有限公司 内存错误处理方法、装置、计算机设备和存储介质
CN113868001A (zh) * 2021-09-10 2021-12-31 苏州浪潮智能科技有限公司 一种内存修复结果的检查方法、系统及计算机存储介质
CN115016963A (zh) * 2022-05-06 2022-09-06 阿里巴巴(中国)有限公司 内存页隔离方法、内存监控系统及计算机可读存储介质

Also Published As

Publication number Publication date
CN117806855A (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
US11232848B2 (en) Memory module error tracking
CN108268340B (zh) 校正存储器中的错误的方法
US20160055059A1 (en) Memory devices and modules
US9411743B2 (en) Detecting memory corruption
JP6815723B2 (ja) メモリシステム及びその動作方法
TW202006548A (zh) 儲存裝置以及多晶片系統
US11080135B2 (en) Methods and apparatus to perform error detection and/or correction in a memory device
JP2006092537A (ja) マージバッファのシステムキルエラーのプロセスキルエラーへの変換技術
US11960350B2 (en) System and method for error reporting and handling
US20180276161A1 (en) PCIe VIRTUAL SWITCHES AND AN OPERATING METHOD THEREOF
US8261134B2 (en) Error management watchdog timers in a multiprocessor computer
CN103984506B (zh) 闪存存储设备数据写的方法和系统
US11003606B2 (en) DMA-scatter and gather operations for non-contiguous memory
CN106445720A (zh) 一种内存错误恢复方法和装置
CN115168088A (zh) 一种针对内存的不可纠正错误的修复方法及装置
CN115328684A (zh) 内存故障的上报方法、bmc及电子设备
CN115168087A (zh) 一种确定内存故障的修复资源粒度的方法及装置
WO2024066500A1 (fr) Procédé et appareil de traitement d'erreur de mémoire
US20120017116A1 (en) Memory control device, memory device, and memory control method
EP4280064A1 (fr) Systèmes et procédés de traitement d'erreur de mémoire extensible
US20220350500A1 (en) Embedded controller and memory to store memory error information
US11755235B2 (en) Increasing random access bandwidth of a DDR memory in a counter application
US9251054B2 (en) Implementing enhanced reliability of systems utilizing dual port DRAM
CN116483612B (zh) 内存故障处理方法、装置、计算机设备和存储介质
CN116401085A (zh) 内存异常处理方法、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23869748

Country of ref document: EP

Kind code of ref document: A1