WO2021135272A1 - 一种内存异常的处理方法、系统、电子设备及存储介质 - Google Patents

一种内存异常的处理方法、系统、电子设备及存储介质 Download PDF

Info

Publication number
WO2021135272A1
WO2021135272A1 PCT/CN2020/110362 CN2020110362W WO2021135272A1 WO 2021135272 A1 WO2021135272 A1 WO 2021135272A1 CN 2020110362 W CN2020110362 W CN 2020110362W WO 2021135272 A1 WO2021135272 A1 WO 2021135272A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
target
delay parameter
read
write
Prior art date
Application number
PCT/CN2020/110362
Other languages
English (en)
French (fr)
Inventor
李双庆
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US17/789,953 priority Critical patent/US11977744B2/en
Publication of WO2021135272A1 publication Critical patent/WO2021135272A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1666Error detection or correction of the data by redundancy in hardware where the redundant component is memory or memory area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/02Detection or location of defective auxiliary circuits, e.g. defective refresh counters
    • G11C29/023Detection or location of defective auxiliary circuits, e.g. defective refresh counters in clock generator or timing circuitry
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/02Detection or location of defective auxiliary circuits, e.g. defective refresh counters
    • G11C29/028Detection or location of defective auxiliary circuits, e.g. defective refresh counters with adaption or trimming of parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0411Online error correction

Definitions

  • This application relates to the field of computer technology, and in particular to a method and system for processing memory abnormalities, an electronic device, and a storage medium.
  • Memory errors can be simply divided into CE (Correct Error) and UCE (Uncorrect Error).
  • CE errors can be corrected by algorithm calculations, but they will reduce system performance. And continuous CE will turn into UCE, leading to software read and write data errors, or even downtime.
  • the purpose of this application is to provide a method, a system, an electronic device and a storage medium for processing memory abnormalities, which can reduce the error rate of memory read and write.
  • the method for processing memory exceptions includes:
  • the memory delay parameter is the waiting time after the memory controller controls the target memory bank to receive a read and write command
  • the method further includes:
  • the target memory module receives the new read and write command sent by the memory controller, and feeds back the Ready signal to the memory controller after delaying the time corresponding to the memory delay parameter, so that the memory controller receives After the Ready signal, read and write operations are performed on the target memory bank.
  • performing a hot removal operation on the target memory bar includes:
  • the address space where the target memory bar is located is removed from the operating system by pulling down the level of the first GPIO pin, so as to set the state of the target memory bar to an unavailable state.
  • performing a hot add operation on the target memory bar includes:
  • the address space where the target memory bar is located is added to the operating system by pulling down the level of the second GPIO pin, so as to restore the state of the target memory bar to a usable state.
  • writing the memory delay parameter to the memory controller includes:
  • the target memory module is a DDR4 memory module or a DDR3 memory module.
  • the calculation memory delay parameter includes:
  • This application also provides a memory exception processing system, which includes:
  • Error number reading module used to read the memory error number of the target memory module in the memory error register
  • a memory hot removal module configured to perform a hot removal operation on the target memory bar when the number of memory error reports is greater than a preset value
  • the parameter setting module is used to calculate the memory delay parameter and write the memory delay parameter to the memory controller; wherein, the memory delay parameter is that the memory controller controls the target memory module to receive read and write How long to wait after the order;
  • the memory hot add module is configured to perform a hot add operation on the target memory bank, so that the memory controller uses the memory delay parameter to continue to perform read and write operations on the target memory bank.
  • the present application also provides a storage medium on which a computer program is stored, and when the computer program is executed, the steps performed by the method for processing memory abnormalities described above are implemented.
  • the present application also provides an electronic device including a memory and a processor, the memory stores a computer program, and the processor invokes the computer program in the memory to implement the steps performed by the method for processing memory exceptions.
  • This application provides a method for processing memory exceptions, which includes reading the number of memory error reports of a target memory bank in a memory error register; when the number of memory error reports is greater than a preset value, performing a hot removal operation on the target memory bank Calculate the memory delay parameter, and write the memory delay parameter to the memory controller; wherein, the memory delay parameter is the waiting time after the memory controller controls the target memory bank to receive a read and write command ; Perform a hot add operation on the target memory bank, so that the memory controller uses the memory delay parameter to continue to perform read and write operations on the target memory bank.
  • the memory delay parameter is the waiting time after the memory controller controls the target memory module to receive a read and write command.
  • the waiting time after the target memory module receives the read and write command is affected by temperature, humidity, and memory status. The impact changes.
  • This application also provides a processing system for memory abnormalities, an electronic device, and a storage medium, which have the above-mentioned beneficial effects, and will not be repeated here.
  • FIG. 1 is a flowchart of a method for processing a memory exception provided by an embodiment of the application
  • FIG. 2 is a structural diagram of a server provided by an embodiment of the application.
  • FIG. 3 is a schematic structural diagram of a processing system for memory exceptions provided by an embodiment of the application.
  • FIG. 1 is a flowchart of a method for processing a memory exception provided by an embodiment of the application.
  • the memory Error register (ie, the memory error register) can store the number of memory errors reported by each memory module. Specifically, the number of memory errors reported in this step may include the number of correctable memory read and write errors.
  • This embodiment does not limit the type of the target memory module, and the specific target memory module may be a DDR4 memory module or a DDR3 memory module.
  • this embodiment can pre-set the threshold as to whether the hot removal operation needs to be performed, that is, the preset value in this step.
  • the preset value the threshold as to whether the hot removal operation needs to be performed.
  • the address space where the target memory module is located can be removed from the operating system by pulling down the level of the first GPIO pin, so as to set the state of the target memory module to an unavailable state.
  • the reason for the read and write error of the memory module is that the time corresponding to the memory delay parameter of the memory controller is less than the actual waiting time required after the target memory module receives the read and write command. Therefore, in order to avoid continued memory read and write errors, this
  • the target memory module is first set to an unavailable state, and then the memory delay parameter is recalculated, and the memory delay parameter is written into the memory controller.
  • the memory delay parameter written into the memory controller is the waiting time after the memory controller controls the target memory module to receive a read and write command.
  • the calculation of the memory delay parameter in this embodiment may be: acquiring the memory initialization code during the startup of the basic input output system, and calculating the memory delay parameter according to the memory initialization code.
  • the memory initialization code is MRC (Memory Reference Code).
  • the basic input output system BIOS contains the memory initialization code.
  • the memory initialization code consists of a series of steps to read and write the memory controller.
  • the memory controller can be read and written through the PECI bus to write the memory delay parameter into the register of the memory controller.
  • S104 Perform a hot add operation on the target memory bank, so that the memory controller uses the memory delay parameter to continue to perform read and write operations on the target memory bank.
  • this step is established on the basis that the memory delay parameter has been written into the memory controller, and after performing the hot add operation on the target memory bank, the memory controller uses the memory delay parameter to continue to check
  • the target memory bar performs read and write operations.
  • the process when the software reads and writes the memory can specifically include: the memory controller sends a read and write command to the target memory module from the CMD signal line, and the target memory module waits for a period of time corresponding to the memory delay parameter after receiving the read and write command. , And then feedback the state of the memory controller through the Ready signal to read and write. After receiving the Ready signal, the memory controller can read and write the data on the memory stick through the DATA data line.
  • performing a hot add operation on the target memory bank in this embodiment may include: adding the address space where the target memory bank is located to the operation by lowering the level of the second GPIO pin System in order to restore the state of the target memory bar to an available state.
  • the memory delay parameter is recalculated, and the recalculated memory delay parameter is written into the memory controller, so that the memory controller can continue to The target memory bank performs read and write operations.
  • the memory delay parameter is the waiting time after the memory controller controls the target memory module to receive a read and write command.
  • the waiting time after the target memory module receives the read and write command is affected by temperature, humidity, and memory status. The impact changes.
  • the memory delay parameter can be reduced by recalculating and setting the memory delay parameter. Memory read and write errors caused by insufficient waiting time, thereby reducing the error rate of memory read and write.
  • the target memory module after performing a hot add operation on the target memory module in S104, the target memory module receives a new read and write command sent by the memory controller, and the memory delay is delayed. After the duration corresponding to the time parameter, the Ready signal is fed back to the memory controller, so that the memory controller performs read and write operations on the target memory bank after receiving the Ready signal.
  • FIG. 2 is a structural diagram of a server provided in an embodiment of this application. This embodiment may include the following steps:
  • Step 1 During the BIOS startup process, the MRC is detected and the action of calculating the memory delay parameter is sent to the BMC through the LPC bus.
  • SMI System Management Interrupt
  • LPC or Lower Pin Count
  • DMI Direct Media Interface
  • Step 2 BIOS startup is complete, enter the operating system
  • Step 3 The BMC periodically reads the memory ERROR register through the PECI bus, and detects the ECC error of the DDR4 memory in the server. If there is an ECC error and reaches a certain threshold N, it triggers the hot removal of the channel where the memory is located, and the address space where the memory is located Removed from the operating system becomes unavailable.
  • Step 4 The BMC executes the "MRC detection sent during the BIOS startup process and the action to calculate the memory delay parameter" in the second step to obtain the memory delay parameter;
  • Step 5 BMC reads and writes the registers of the memory controller through the PECI bus, and the memory delay parameter is set in the memory controller;
  • Step 6 The BMC then triggers the hot add action of the channel where the memory is located, and the address space where the memory is located is added to the operating system.
  • This embodiment proposes a solution to reduce the error rate of memory read and write.
  • the BMC can automatically trigger the hot removal of the channel where the memory is located, and the memory delay is calculated. Time parameter, the memory delay parameter is set to the memory controller, BMC then triggers the hot add action of the channel where the memory is located, the address space where the memory is located is added to the operating system, and the subsequent software uses this parameter when reading and writing the memory.
  • Memory delay parameters to ensure the accuracy of data read and write by the software and reduce the error rate of memory read and write. This embodiment does not need to stop or restart, and can automatically set memory parameters according to changes in environmental conditions and the like, reducing the error rate of software reading and writing memory.
  • the memory delay parameters are different under different machines, memory, ambient temperature, and humidity, after the server has been running for a period of time, changes in temperature and humidity may cause delays and changes in actual requirements.
  • the software under the operating system reads and writes memory.
  • the memory controller still uses the original memory delay parameters, the read and write data may report abnormal errors.
  • the parameter t changes and the memory read and write errors are reported, after the BMC of this application detects that the ECC error of a certain DDR4 memory reaches a certain threshold, it triggers the hot removal of the channel where the memory is located, and the address space where the memory is located is moved from the operating system Unless it becomes unavailable, the application software under the operating system no longer reads or writes the memory.
  • the BMC executes the MRC detection sent during the BIOS startup process, calculates the action of the new memory delay parameter, calculates the new memory delay parameter, and sets the new memory delay parameter to the memory controller.
  • the BMC then triggers the hot add action of the channel where the memory is located, and the address space where the memory is located is added to the operating system.
  • the new memory delay parameter is used to ensure the accuracy of the software read and write data.
  • FIG. 3 is a schematic structural diagram of a memory exception processing system provided by an embodiment of the application.
  • the system can include:
  • the error number reading module 100 is used to read the memory error number of the target memory module in the memory Error register;
  • the memory hot removal module 200 is configured to perform a hot removal operation on the target memory module when the number of memory error reports is greater than a preset value
  • the parameter setting module 300 is used to calculate the memory delay parameter and write the memory delay parameter into the memory controller; wherein, the memory delay parameter is that the memory controller controls the target memory module to receive the read Waiting time after writing the command;
  • the memory hot add module 400 is configured to perform a hot add operation on the target memory bank, so that the memory controller uses the memory delay parameter to continue to perform read and write operations on the target memory bank.
  • the memory delay parameter is recalculated, and the recalculated memory delay parameter is written into the memory controller, so that the memory controller can continue to The target memory bank performs read and write operations.
  • the memory delay parameter is the waiting time after the memory controller controls the target memory module to receive a read and write command.
  • the waiting time after the target memory module receives the read and write command is affected by temperature, humidity, and memory status. The impact changes.
  • the memory delay parameter can be reduced by recalculating and setting the memory delay parameter. Memory read and write errors caused by insufficient waiting time, thereby reducing the error rate of memory read and write.
  • a memory read and write module configured to receive a new read and write command sent by the memory controller after the target memory module performs a hot add operation, and delay the time corresponding to the memory delay parameter Then, the Ready signal is fed back to the memory controller, so that the memory controller performs read and write operations on the target memory bank after receiving the Ready signal.
  • the memory hot removal module 200 is specifically configured to remove the address space where the target memory module is located from the operating system by lowering the level of the first GPIO pin, so as to set the state of the target memory module Is unavailable.
  • the memory hot add module 400 is specifically configured to add the address space where the target memory bank is located to the operating system by lowering the level of the second GPIO pin, so as to restore the state of the target memory bank to Available status.
  • parameter setting module 300 includes:
  • the parameter writing unit is configured to read and write the memory controller through the PECI bus and write the memory delay parameter to the register of the memory controller.
  • the target memory module is a DDR4 memory module or a DDR3 memory module.
  • parameter setting module 300 includes:
  • the parameter calculation unit is configured to obtain the memory initialization code during the startup of the basic input output system, and calculate the memory delay parameter according to the memory initialization code.
  • the present application also provides a storage medium on which a computer program is stored, and when the computer program is executed, the steps provided in the above-mentioned embodiments can be implemented.
  • the storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program code.
  • the present application also provides an electronic device, which may include a memory and a processor, the memory stores a computer program, and when the processor invokes the computer program in the memory, the steps provided in the foregoing embodiments can be implemented.
  • the electronic device may also include various network interfaces, power supplies and other components.

Abstract

一种内存异常的处理方法、系统、电子设备及存储介质,该方法包括:读取内存错误寄存器中目标内存条的内存报错数量;当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;计算内存延时参数,并将所述内存延时参数写入内存控制器;其中,所述内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长;对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。由此可见,本申请能够降低内存读写的报错率。

Description

一种内存异常的处理方法、系统、电子设备及存储介质
本申请要求于2019年12月29日提交中国专利局、申请号为201911386480.9、发明名称为“一种内存异常的处理方法、系统、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种内存异常的处理方法、系统、一种电子设备及一种存储介质。
背景技术
随着计算机技术的发展,对服务器的稳定性可靠性的要求越来越高,内存的运行频率也越来越快,频率越高,信号更容易受到干扰出错,导致服务器内存报错,可靠性降低。内存的报错简单可分为CE(Correct Error,可纠正错误)和UCE(Uncorrect Error,不可纠正错误)两种,UCE导致软件读写数据错误,CE报错可用算法计算纠正但会导致系统性能降低,并且持续的CE会变为UCE导致软件读写数据错误,甚至宕机。
因此,如何降低内存读写的报错率是本领域技术人员目前需要解决的技术问题。
发明内容
本申请的目的是提供一种内存异常的处理方法、系统、一种电子设备及一种存储介质,能够降低内存读写的报错率。
为解决上述技术问题,本申请提供一种内存异常的处理方法,该内存异常的处理方法包括:
读取内存错误寄存器中目标内存条的内存报错数量;
当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;
计算内存延时参数,并将所述内存延时参数写入内存控制器;其中, 所述内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长;
对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
可选的,在对所述目标内存条执行热添加操作之后,还包括:
所述目标内存条接收所述内存控制器发送的新读写命令,在延时所述内存延时参数对应的时长后向所述内存控制器反馈Ready信号,以便所述内存控制器在接收到所述Ready信号之后对所述目标内存条执行读写操作。
可选的,对所述目标内存条执行热移除操作包括:
通过拉低第一GPIO引脚的电平将所述目标内存条所在的地址空间从操作系统中移除,以便将所述目标内存条的状态设置为不可用状态。
可选的,对所述目标内存条执行热添加操作包括:
通过拉低第二GPIO引脚的电平将所述目标内存条所在的地址空间添加至所述操作系统,以便将所述目标内存条的状态恢复为可用状态。
可选的,将所述内存延时参数写入内存控制器包括:
通过PECI总线读写所述内存控制器将所述内存延时参数写入所述内存控制器的寄存器。
可选的,所述目标内存条为DDR4内存条或DDR3内存条。
可选的,所述计算内存延时参数包括:
获取基本输入输出系统启动过程中的内存初始化代码,根据所述内存初始化代码计算所述内存延时参数。
本申请还提供了一种内存异常的处理系统,该系统包括:
报错数读取模块,用于读取内存错误寄存器中目标内存条的内存报错数量;
内存热移除模块,用于当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;
参数设置模块,用于计算内存延时参数,并将所述内存延时参数写入内存控制器;其中,所述内存延时参数为所述内存控制器控制所述目标内 存条接收到读写命令后的等待时长;
内存热添加模块,用于对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
本申请还提供了一种存储介质,其上存储有计算机程序,所述计算机程序执行时实现上述内存异常的处理方法执行的步骤。
本申请还提供了一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时实现上述内存异常的处理方法执行的步骤。
本申请提供了一种内存异常的处理方法,包括读取内存错误寄存器中目标内存条的内存报错数量;当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;计算内存延时参数,并将所述内存延时参数写入内存控制器;其中,所述内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长;对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
本申请在内存错误寄存器中目标内存条的内存报错数量大于预设值时,重新计算内存延时参数,并将重新计算的内存延时参数写入内存控制器,以便内存控制器继续对所述目标内存条执行读写操作。内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长,而实际上目标内存条接收到读写命令后的需要等待的时长受到温度、湿度和内存状态影响而变化,当实际需要的等待时长变化长,而内存控制器中的内存延时参数不变时,将会导致内存报错数量增加,因此本申请通过重新计算并设置内存延时参数能够降低因等待时长不足导致的内存读写错误,进而降低内存读写的报错率。本申请同时还提供了一种内存异常的处理系统、一种电子设备和一种存储介质,具有上述有益效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对 实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例所提供的一种内存异常的处理方法的流程图;
图2为本申请实施例所提供的一种服务器的结构图;
图3为本申请实施例所提供的一种内存异常的处理系统的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
下面请参见图1,图1为本申请实施例所提供的一种内存异常的处理方法的流程图。
具体步骤可以包括:
S101:读取内存Error寄存器中目标内存条的内存报错数量;
其中,内存Error寄存器(即内存错误寄存器)中可以存储各个内存条的内存报错数量,具体的本步骤中所提到的内存报错数量可以包括可纠正的内存读写错误的数量。本实施例不限定目标内存条的种类,具体的目标内存条可以为DDR4内存条或DDR3内存条。
S102:当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;
其中,本实施例可以预先设置关于是否需要执行热移除操作的阈值,即本步骤中的预设值,当目标内存跳的内存报错数量大于预设值时,说明目标内存条中已经出现了足够多的读写错误,若目标内存条出现的读写错误继续增加将会导致宕机,因此此时可以对于目标内存条执行热移除操作。
具体的,本实施例可以通过拉低第一GPIO引脚的电平将所述目标内存条所在的地址空间从操作系统中移除,以便将所述目标内存条的状态设置为不可用状态。
S103:计算内存延时参数,并将所述内存延时参数写入内存控制器;
其中,导致内存条出现读写错误的原因在于内存控制器的内存延时参数对应的时长小于目标内存条接收到读写命令后实际需要的等待时长,因此为了避免继续出现内存读写错误,本实施例先将目标内存条设置为不可用状态,再重新计算内存延时参数,并将内存延时参数写入内存控制器。本实施例中写入内存控制器的内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长。
作为一种可行的实施方式,本实施例中计算内存延时参数可以为:获取基本输入输出系统启动过程中的内存初始化代码,根据所述内存初始化代码计算所述内存延时参数。内存初始化代码即MRC(Memory Reference Code),基本输入输出系统BIOS中包含有内存初始化代码,内存初始化代码由一系列读写内存控制器的步骤组成。
作为另一种可行的实施方式,本实施例可以通过PECI总线读写所述内存控制器将所述内存延时参数写入所述内存控制器的寄存器。
S104:对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
其中,本步骤建立在已经将所述内存延时参数写入内存控制器的基础上,在对所述目标内存条执行热添加操作之后,所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。具体的,软件读写内存时的过程具体可以包括:内存控制器从CMD信号线上发送读写命令到目标内存条,目标内存条收到读写命令后,等待一段内存延时参数对应的时间,然后通过Ready信号反馈给内存控制器状态可读写,内存控制器在接收到Ready信号之后可以通过DATA数据线读写内存条上的数据。
作为一种可行的实施方式,本实施例对所述目标内存条执行热添加操作可以包括:通过拉低第二GPIO引脚的电平将所述目标内存条所在的地址空间添加至所述操作系统,以便将所述目标内存条的状态恢复为可用状态。
本实施例在Error寄存器中目标内存条的内存报错数量大于预设值时,重新计算内存延时参数,并将重新计算的内存延时参数写入内存控制器,以便内存控制器继续对所述目标内存条执行读写操作。内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长,而实际上目标内存条接收到读写命令后的需要等待的时长受到温度、湿度和内存状态影响而变化,当实际需要的等待时长变化长,而内存控制器中的内存延时参数不变时,将会导致内存报错数量增加,因此本实施例通过重新计算并设置内存延时参数能够降低因等待时长不足导致的内存读写错误,进而降低内存读写的报错率。
作为对于图1对应实施例的进一步介绍,在S104对所述目标内存条执行热添加操作之后,所述目标内存条接收所述内存控制器发送的新读写命令,在延时所述内存延时参数对应的时长后向所述内存控制器反馈Ready信号,以便所述内存控制器在接收到所述Ready信号之后对所述目标内存条执行读写操作。
下面通过在实际应用中的实施例说明上述实施例描述的流程。请参见图2,图2为本申请实施例所提供的一种服务器的结构图,本实施例可以包括以下步骤:
步骤1:BIOS启动过程中将MRC检测,计算内存延时参数的动作通过LPC总线发给BMC。设置GPIO1,GPIO2为SMI属性,GPIO1用于通知OS内存热移除,GPIO2用于通知OS内存热添加;
其中,SMI即System Management Interrupt,为X86平台下一种中断。LPC即Lower Pin Count,为一种BMC与PCH之间进行通信的数据总线。CPU与PCH(Platform Controller Hub,集成南桥)之间通过DMI(Direct Media Interface,直接媒体接口)连接。
步骤2:BIOS启动完成,进入操作系统下;
步骤3:BMC周期性通过PECI总线读内存ERROR寄存器,检测服务器中DDR4内存ECC报错,如果有ECC报错并且到达一定阈值数量N后,触发该内存所在通道的热移除动作,该内存所在地址空间从操作系统中移除 变为不可用。
步骤4:BMC执行第二步中“BIOS启动过程中发送的MRC检测,计算内存延时参数的动作”,得到内存延时参数;
步骤5:BMC通过PECI总线,读写内存控制器的寄存器,该内存延时参数设置到内存控制器中;
步骤6:BMC再触发该内存所在通道的热添加动作,该内存所在地址空间添加到操作系统中。
本实施例提出一种降低内存读写报错率的方案,当外界因素导致内存参数t有变化,内存读写有报错时,BMC可自动触发该内存所在通道的热移除动作,计算得到内存延时参数,将该内存延时参数设置到内存控制器中,BMC再触发该内存所在通道的热添加动作,该内存所在地址空间添加到操作系统中,后续软件读写该内存时,使用该参数内存延时参数,从而保证软件读写数据的准确性,降低内存读写报错率。本实施例无需停机或者重启,即可自动根据环境条件等变化,自动设置内存参数,减少软件读写内存的报错率。
由于该内存延时参数在不同的机器、内存、环境温度、湿度下不同,在服务器运行一段时间后,温度,湿度变化可能会导致实际需求的延时、改变,操作系统下的软件读写内存时,内存控制器仍使用原内存延时参数,导致读写的数据可能异常报错。当参数t有变化,内存读写有报错,在本申请BMC检测到某条DDR4内存ECC报错到达一定阈值后,触发该内存所在通道的热移除动作,该内存所在地址空间从操作系统中移除变为不可用,操作系统下应用软件不再读写该内存。然后BMC执行BIOS启动过程中发过来的MRC检测,计算新内存延时参数的动作,计算得到新内存延时参数,将新内存延时参数设置到内存控制器中。BMC再触发该内存所在通道的热添加动作,该内存所在地址空间添加到操作系统中,后续软件读写该内存时,使用新内存延时参数从而保证软件读写数据的准确。
请参见图3,图3为本申请实施例所提供的一种内存异常的处理系统的结构示意图;
该系统可以包括:
报错数读取模块100,用于读取内存Error寄存器中目标内存条的内存报错数量;
内存热移除模块200,用于当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;
参数设置模块300,用于计算内存延时参数,并将所述内存延时参数写入内存控制器;其中,所述内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长;
内存热添加模块400,用于对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
本实施例在Error寄存器中目标内存条的内存报错数量大于预设值时,重新计算内存延时参数,并将重新计算的内存延时参数写入内存控制器,以便内存控制器继续对所述目标内存条执行读写操作。内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长,而实际上目标内存条接收到读写命令后的需要等待的时长受到温度、湿度和内存状态影响而变化,当实际需要的等待时长变化长,而内存控制器中的内存延时参数不变时,将会导致内存报错数量增加,因此本实施例通过重新计算并设置内存延时参数能够降低因等待时长不足导致的内存读写错误,进而降低内存读写的报错率。
进一步的,还包括:
内存读写模块,用于在对所述目标内存条执行热添加操作之后,所述目标内存条接收所述内存控制器发送的新读写命令,在延时所述内存延时参数对应的时长后向所述内存控制器反馈Ready信号,以便所述内存控制器在接收到所述Ready信号之后对所述目标内存条执行读写操作。
进一步的,内存热移除模块200具体用于通过拉低第一GPIO引脚的电平将所述目标内存条所在的地址空间从操作系统中移除,以便将所述目标内存条的状态设置为不可用状态。
进一步的,内存热添加模块400具体用于通过拉低第二GPIO引脚的电 平将所述目标内存条所在的地址空间添加至所述操作系统,以便将所述目标内存条的状态恢复为可用状态。
进一步的,参数设置模块300包括:
参数写入单元,用于通过PECI总线读写所述内存控制器将所述内存延时参数写入所述内存控制器的寄存器。
进一步的,所述目标内存条为DDR4内存条或DDR3内存条。
进一步的,参数设置模块300包括:
参数计算单元,用于获取基本输入输出系统启动过程中的内存初始化代码,根据所述内存初始化代码计算所述内存延时参数。
由于系统部分的实施例与方法部分的实施例相互对应,因此系统部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。
本申请还提供了一种存储介质,其上存有计算机程序,该计算机程序被执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请还提供了一种电子设备,可以包括存储器和处理器,所述存储器中存有计算机程序,所述处理器调用所述存储器中的计算机程序时,可以实现上述实施例所提供的步骤。当然所述电子设备还可以包括各种网络接口,电源等组件。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语 仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (10)

  1. 一种内存异常的处理方法,其特征在于,包括:
    读取内存错误寄存器中目标内存条的内存报错数量;
    当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;
    计算内存延时参数,并将所述内存延时参数写入内存控制器;其中,所述内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长;
    对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
  2. 根据权利要求1所述处理方法,其特征在于,在对所述目标内存条执行热添加操作之后,还包括:
    所述目标内存条接收所述内存控制器发送的新读写命令,在延时所述内存延时参数对应的时长后向所述内存控制器反馈Ready信号,以便所述内存控制器在接收到所述Ready信号之后对所述目标内存条执行读写操作。
  3. 根据权利要求1所述处理方法,其特征在于,对所述目标内存条执行热移除操作包括:
    通过拉低第一GPIO引脚的电平将所述目标内存条所在的地址空间从操作系统中移除,以便将所述目标内存条的状态设置为不可用状态。
  4. 根据权利要求1所述处理方法,其特征在于,对所述目标内存条执行热添加操作包括:
    通过拉低第二GPIO引脚的电平将所述目标内存条所在的地址空间添加至所述操作系统,以便将所述目标内存条的状态恢复为可用状态。
  5. 根据权利要求1所述处理方法,其特征在于,将所述内存延时参数写入内存控制器包括:
    通过PECI总线读写所述内存控制器将所述内存延时参数写入所述内存控制器的寄存器。
  6. 根据权利要求1所述处理方法,其特征在于,所述目标内存条为DDR4内存条或DDR3内存条。
  7. 根据权利要求1至6任一项所述处理方法,其特征在于,所述计算内存延时参数包括:
    获取基本输入输出系统启动过程中的内存初始化代码,根据所述内存初始化代码计算所述内存延时参数。
  8. 一种内存异常的处理系统,其特征在于,包括:
    报错数读取模块,用于读取内存错误寄存器中目标内存条的内存报错数量;
    内存热移除模块,用于当所述内存报错数量大于预设值时,对所述目标内存条执行热移除操作;
    参数设置模块,用于计算内存延时参数,并将所述内存延时参数写入内存控制器;其中,所述内存延时参数为所述内存控制器控制所述目标内存条接收到读写命令后的等待时长;
    内存热添加模块,用于对所述目标内存条执行热添加操作,以便所述内存控制器利用所述内存延时参数继续对所述目标内存条执行读写操作。
  9. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时实现如权利要求1至7任一项所述内存异常的处理方法的步骤。
  10. 一种存储介质,其特征在于,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如上权利要求1至7任一项所述内存异常的处理方法的步骤。
PCT/CN2020/110362 2019-12-29 2020-08-21 一种内存异常的处理方法、系统、电子设备及存储介质 WO2021135272A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/789,953 US11977744B2 (en) 2019-12-29 2020-08-21 Memory anomaly processing method and system, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911386480.9A CN111143104A (zh) 2019-12-29 2019-12-29 一种内存异常的处理方法、系统、电子设备及存储介质
CN201911386480.9 2019-12-29

Publications (1)

Publication Number Publication Date
WO2021135272A1 true WO2021135272A1 (zh) 2021-07-08

Family

ID=70521412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/110362 WO2021135272A1 (zh) 2019-12-29 2020-08-21 一种内存异常的处理方法、系统、电子设备及存储介质

Country Status (3)

Country Link
US (1) US11977744B2 (zh)
CN (1) CN111143104A (zh)
WO (1) WO2021135272A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143104A (zh) * 2019-12-29 2020-05-12 苏州浪潮智能科技有限公司 一种内存异常的处理方法、系统、电子设备及存储介质
CN111625387B (zh) * 2020-05-27 2024-03-29 北京金山云网络技术有限公司 内存错误处理方法、装置及服务器
CN113747043B (zh) * 2020-05-29 2023-06-20 Oppo广东移动通信有限公司 图像处理器启动方法、电子设备和存储介质
US20220326887A1 (en) * 2021-04-06 2022-10-13 Micron Technology, Inc. Log management maintenance operation and command
CN114003295B (zh) * 2021-10-15 2023-08-25 苏州浪潮智能科技有限公司 一种内存参数的设置方法、系统及装置
CN114816822A (zh) * 2022-05-07 2022-07-29 宝德计算机系统股份有限公司 一种基于内存故障的服务器管理方法、装置以及系统
CN117112452B (zh) * 2023-08-24 2024-04-02 上海合芯数字科技有限公司 寄存器模拟配置方法、装置、计算机设备和存储介质
CN116931845B (zh) * 2023-09-18 2023-12-12 新华三信息技术有限公司 一种数据布局方法、装置及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030070055A1 (en) * 2001-09-28 2003-04-10 Johnson Jerome J. Memory latency and bandwidth optimizations
CN105608034A (zh) * 2015-12-23 2016-05-25 浪潮集团有限公司 一种clump自动热插拔的方法
CN110428856A (zh) * 2019-07-29 2019-11-08 珠海市一微半导体有限公司 一种用于读写ddr内存的延时参数优化方法和系统
CN111143104A (zh) * 2019-12-29 2020-05-12 苏州浪潮智能科技有限公司 一种内存异常的处理方法、系统、电子设备及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6370604B1 (en) * 1999-05-07 2002-04-09 Intel Corporation Hot replacement of storage device in serial array of storage devices
US7673090B2 (en) * 2001-12-19 2010-03-02 Intel Corporation Hot plug interface control method and apparatus
US7730470B2 (en) * 2006-02-27 2010-06-01 Oracle America, Inc. Binary code instrumentation to reduce effective memory latency
KR20090045672A (ko) * 2007-11-02 2009-05-08 주식회사 하이닉스반도체 지연고정회로, 반도체 메모리 장치 및 그 동작방법
CN101359306B (zh) * 2008-09-26 2012-12-05 华硕电脑股份有限公司 内存调整结果检测方法及其计算机系统
US8996765B2 (en) * 2011-12-27 2015-03-31 Intel Corporation Methods and apparatus to manage workload memory allocation
CN102637155B (zh) * 2012-01-10 2014-11-05 江苏中科梦兰电子科技有限公司 通过训练加修正配置ddr3中数据选通信号延时的方法
US10853311B1 (en) * 2014-07-03 2020-12-01 Pure Storage, Inc. Administration through files in a storage system
US20160019160A1 (en) * 2014-07-17 2016-01-21 Sandisk Enterprise Ip Llc Methods and Systems for Scalable and Distributed Address Mapping Using Non-Volatile Memory Modules
CN106936616B (zh) * 2015-12-31 2020-01-03 伊姆西公司 备份通信方法和装置
US10148416B2 (en) * 2016-09-02 2018-12-04 Intel Corporation Signal phase optimization in memory interface training
CN106445720A (zh) * 2016-10-11 2017-02-22 郑州云海信息技术有限公司 一种内存错误恢复方法和装置
CN107301103A (zh) * 2017-06-22 2017-10-27 济南浪潮高新科技投资发展有限公司 一种调整国产处理器的内存参数的方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030070055A1 (en) * 2001-09-28 2003-04-10 Johnson Jerome J. Memory latency and bandwidth optimizations
CN105608034A (zh) * 2015-12-23 2016-05-25 浪潮集团有限公司 一种clump自动热插拔的方法
CN110428856A (zh) * 2019-07-29 2019-11-08 珠海市一微半导体有限公司 一种用于读写ddr内存的延时参数优化方法和系统
CN111143104A (zh) * 2019-12-29 2020-05-12 苏州浪潮智能科技有限公司 一种内存异常的处理方法、系统、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Huawei KunLun open architecture machine allows In-Memory applications to continue", 11 April 2018 (2018-04-11), XP055828297, Retrieved from the Internet <URL:https://cloud.tencent.com/developer/news/176970> *

Also Published As

Publication number Publication date
CN111143104A (zh) 2020-05-12
US11977744B2 (en) 2024-05-07
US20220391103A1 (en) 2022-12-08

Similar Documents

Publication Publication Date Title
WO2021135272A1 (zh) 一种内存异常的处理方法、系统、电子设备及存储介质
WO2020177493A1 (zh) 内存错误处理方法和装置
US7945815B2 (en) System and method for managing memory errors in an information handling system
US9389937B2 (en) Managing faulty memory pages in a computing system
EP3132449B1 (en) Method, apparatus and system for handling data error events with memory controller
US11132314B2 (en) System and method to reduce host interrupts for non-critical errors
US10713128B2 (en) Error recovery in volatile memory regions
US20090150721A1 (en) Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System
US11360847B2 (en) Memory scrub system
US8301992B2 (en) System and apparatus for error-correcting register files
CN111984487A (zh) 一种离机记录故障硬件位置的方法及装置
TW202040361A (zh) 伺服器及錯誤事件紀錄登載功能的控制方法
US10635554B2 (en) System and method for BIOS to ensure UCNA errors are available for correlation
WO2022021854A1 (zh) 一种存储系统的控制器升级方法及相关装置
EP4280064A1 (en) Systems and methods for expandable memory error handling
WO2023206963A1 (zh) 一种数据处理方法、系统及相关组件
US20240028729A1 (en) Bmc ras offload driver update via a bios update release
TWI665606B (zh) 資料儲存裝置之測試系統與資料儲存裝置之測試方法
US10592329B2 (en) Method and electronic device for continuing executing procedure being aborted from physical address where error occurs
CN116483612B (zh) 内存故障处理方法、装置、计算机设备和存储介质
WO2024066500A1 (zh) 内存错误处理方法及装置
US20240012651A1 (en) Enhanced service operating system capabilities through embedded controller system health state tracking
TWI777259B (zh) 開機方法
CN112099980A (zh) 服务器及错误事件记录登载功能的控制方法
CN107451035B (zh) 用于计算机装置的错误状态数据提供方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908900

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20908900

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20908900

Country of ref document: EP

Kind code of ref document: A1