WO2024082844A1 - 一种内存条故障检测装置及检测方法 - Google Patents

一种内存条故障检测装置及检测方法 Download PDF

Info

Publication number
WO2024082844A1
WO2024082844A1 PCT/CN2023/116850 CN2023116850W WO2024082844A1 WO 2024082844 A1 WO2024082844 A1 WO 2024082844A1 CN 2023116850 W CN2023116850 W CN 2023116850W WO 2024082844 A1 WO2024082844 A1 WO 2024082844A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
latch
management controller
fault detection
detection device
Prior art date
Application number
PCT/CN2023/116850
Other languages
English (en)
French (fr)
Inventor
王为
Original Assignee
超聚变数字技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 超聚变数字技术有限公司 filed Critical 超聚变数字技术有限公司
Publication of WO2024082844A1 publication Critical patent/WO2024082844A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair

Definitions

  • the present application relates to the field of memory fault detection of computing devices, and in particular to a memory bar fault detection method of a computing device.
  • RAM Random Access Memory
  • CPU Centralized Memory
  • RAM Random Access Memory
  • CPU Centralized Memory
  • RAM Random Access Memory
  • RAM Random Access Memory
  • CPU CPU
  • RAM Random Access Memory
  • the memory process is getting smaller and smaller, and the operating frequency is getting higher and higher.
  • the operating voltage is forced to be continuously reduced, which makes the reliability of the memory medium itself continue to decrease, and the failure rate is getting higher and higher.
  • memory failure has become one of the most common sources of failure in the field of computing devices.
  • the memory fails, it will cause the computing device to restart abnormally or crash in serious cases. Therefore, it is particularly important to detect the health status of the RAM. Monitoring it can identify and process risky memory in advance, reducing the risk of computing device system crashes.
  • the memory stick is a passive device and cannot determine its own health status.
  • the identification of the health status of the memory stick depends on the BMC log records of the out-of-band management system. However, this identification method is not intuitive enough.
  • the operation and maintenance personnel cannot detect memory failures in time.
  • Memory failures can be divided into correctable errors (Correctable Error, CE, hereinafter referred to as CE) and uncorrectable errors (UnCorrectable Error, UCE, hereinafter referred to as UCE).
  • CE Correctable Error
  • UCE UnCorrectable Error
  • the embodiments of the present application provide a memory stick fault detection device and method, which improve the timeliness and accuracy of memory stick fault detection, facilitate timely identification of faulty memory sticks and replacement thereof, and reduce the risk of computing device downtime.
  • a fault detection device for a memory bar comprising a memory management controller and a latch, the input end of the memory management controller is connected to the memory bar, the output end of the memory management controller is connected to an input end of the latch; the memory management controller is used to detect a correctable error CE of the memory bar and determine the correctable error CE of the memory bar The number of correctable CE errors is corrected. When the number of correctable CE errors of the memory bank exceeds a preset threshold value, the memory management controller sends a latch signal to the latch.
  • the technical solution performs fault detection on a memory stick through a memory fault detection device, and sets a memory management controller and a latch in the fault detection device, so that the fault information of the memory stick can be acquired by the memory management controller, and the number of correctable memory errors CE is determined from the fault information.
  • the memory control manager sends a latch signal to the latch, and the latch signal is used to indicate that the number of correctable errors CE of the memory stick has reached a critical value, and the memory stick needs to be repaired or replaced in time.
  • the latch signal sent by the memory management controller to the latch allows the CE error of the memory stick to be discovered in time, thereby enhancing the timeliness of memory stick fault detection and reducing the risk of computing device downtime.
  • the memory management controller also includes a counter and a processor, wherein the counter is used to count the number of correctable errors CE, and the processor is used to compare the number of correctable errors CE counted by the counter with a preset threshold value, and send a latch signal to the latch when the number of CE is greater than its preset threshold value.
  • the memory management controller further includes a counter and a processor, the counter counts the number of correctable errors CE of the memory stick, and the processor compares the number with a preset threshold value, so that the statistics of the number of correctable errors CE of the memory stick are more accurate.
  • the memory management controller includes a BIOS chip and a baseboard management controller BMC.
  • the fault information of the memory bar is obtained through the BIOS chip, and the number of correctable errors CE is determined according to the fault information of the memory bar, wherein the fault information of the memory bar includes the number of correctable errors CE;
  • the BIOS chip can also send the number of correctable errors CE to the baseboard management controller BMC;
  • the baseboard management controller BMC receives the number of correctable errors CE sent by the BIOS chip, and compares the number of correctable errors CE with a preset threshold value, and sends a latch signal to the latch when the number of CEs is greater than the preset threshold value.
  • BIOS chip and baseboard management controller BMC of the computing device itself can be used to implement the function of the memory management controller, further reducing the cost of the memory bar fault detection device.
  • the fault detection device further includes a memory, which is disposed in a memory bar or a memory management controller, and is used to store and record fault information of the memory bar.
  • the memory bar fault information is not easily lost, thereby facilitating the maintenance of the memory bar.
  • the memory management controller determines the target faulty memory bar that exceeds the threshold value, and sends a latch signal to the latch corresponding to the target faulty memory bar; wherein the target faulty memory bar is one or more of the multiple memory bars.
  • each latch is connected to each memory stick.
  • the memory management controller can detect faults of multiple memory sticks, so that when any memory stick fails, the memory management controller can send a latch signal to the latch corresponding to the memory stick, thereby detecting faults of multiple memory sticks.
  • the fault detection device further includes a status indicator connected to the output end of the latch; the status indicator is used to: when the latch receives the latch signal sent by the memory management controller, In the case of a signal, the receiving latch outputs a high level and sends out an alarm signal.
  • the fault display of the memory stick is made more intuitive.
  • the status indicator is an indicator light and/or a buzzer.
  • the fault information of the memory stick is represented by a light signal or a sound signal, which further improves the intuitiveness and effectiveness of the memory stick fault display.
  • the memory stick further includes a PCB circuit board, and the status indicator is arranged on the PCB circuit board or on a slot of the computing device, and the slot is used to install the memory stick.
  • the computing device is a server.
  • the fault detection device can be applied to a server, thereby improving the timeliness and accuracy of server memory fault detection and reducing the risk of server downtime.
  • an embodiment of the present application provides a memory bar fault detection method, which is used in a memory bar fault detection device, wherein the memory bar fault detection device includes a memory management controller, a latch, and a status indicator, wherein an input end of the memory management controller is connected to the memory bar, an output end of the memory management controller is connected to an input end of a latch, and an output end of the latch is connected to a status indicator, wherein the method includes the following steps:
  • the memory fault detection device obtains fault information of the memory bar, and determines the number of correctable errors CE of the memory bar according to the fault information of the memory bar;
  • the memory fault detection device determines the faulty memory stick that exceeds the threshold value and sends a latch signal to the latch corresponding to the faulty memory stick; the latch signal is used to: cause the status indicator to send an alarm signal.
  • the technical solution through the memory stick fault detection method, enables the fault information of the memory stick to be obtained, and the number of correctable memory errors CE is determined from the fault information.
  • the number of correctable memory errors CE exceeds a preset threshold, a latch signal is sent to the latch, and the latch signal is used to indicate that the number of correctable errors CE of the memory stick has reached a critical value, and the memory stick needs to be repaired or replaced in time.
  • the latch signal Through the latch signal, the CE error of the memory stick can be discovered in time, thereby enhancing the timeliness of memory stick fault detection and reducing the risk of computing device downtime.
  • an embodiment of the present application provides a computer-readable storage medium, in which at least one computer program is stored.
  • the computer program is loaded and executed by a processor to implement the memory bar fault detection method as described in the second aspect above.
  • an embodiment of the present application provides a computer program product or a computer program, the computer program product or the computer program including computer instructions, the computer instructions being stored in a computer-readable storage medium.
  • a processor of a terminal reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the terminal executes the memory fault prediction method provided in various optional implementations of the second aspect above.
  • FIG1 is a schematic diagram of memory fault detection in the related art
  • FIG2 is a schematic diagram of a memory bar fault detection device provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of another memory bar fault detection device provided in an embodiment of the present application.
  • FIG4 is a connection diagram of a plurality of memory bar fault detection provided by an embodiment of the present application.
  • FIG5 is a schematic diagram of a detection circuit provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of installing a status indicator provided in an embodiment of the present application.
  • FIG. 7 is a flow chart of a memory bar fault detection method provided in an embodiment of the present application.
  • CPU Central processing unit (CPU) is the computing and control core of a computing device and the final execution unit for information processing and program execution.
  • Register is a memory inside the CPU that is used to temporarily store instructions, data, and addresses. Registers are high-speed storage components with limited storage capacity that can be used to temporarily store instructions, data, and addresses.
  • BIOS Basic Input Output System, is the firmware that performs hardware initialization during the power-on phase and provides runtime services for the operating system. BIOS is mostly stored on flash memory chips to facilitate BIOS updates.
  • BMC Baseboard Management Controller, which can upgrade the firmware of the device, manage the operation status of the device, and troubleshoot faults when the computing device is not turned on.
  • the baseboard management controller can maintain the program code in the memory of the computing device, including upgrading or restoring it.
  • the baseboard management controller can also control the power circuit or clock circuit in the computing device.
  • each computer device may include multiple memory bars (Dimm), each memory bar has two memory ranks, which are located on two sides of the memory, for example, the two memory ranks are memory rank 0 and memory rank 1.
  • Multiple memory chips can be configured on each memory rank for storing data.
  • the memory chip can also be called a memory particle (device).
  • the memory chip can be a dynamic random access memory (DRAM), a static random access memory (SRAM), etc.
  • Each memory chip can be divided into multiple storage arrays (bank). When the memory chip stores data, the data is written into a storage array in bits.
  • multiple storage arrays can be grouped into a storage array group (bankgroup), wherein the number of storage arrays in each storage array group can be the same or different.
  • the storage array is composed of a large number of storage cells, which are arranged in a two-dimensional matrix form. As long as the row (row) and column (column) on the storage array are specified, a storage cell can be located on the storage array.
  • the minimum unit of memory failure is the storage cell on the storage array. In other words, memory failure includes at least one of: Dimm failure, Rank failure, Device failure, Bank failure, Row failure/Column failure, Cell failure and Bit failure.
  • Latch A storage unit circuit that is sensitive to pulse levels and can change the state of a device under the action of a specific input pulse level. Change state and temporarily store the signal to maintain a certain level state.
  • the computing device in the embodiment of the present application takes a server as an example.
  • the existing fault detection method for the server memory stick can be specifically shown in Figure 1.
  • the processor 1 detects the CE of the memory stick. Specifically, the processor 1 will record the corresponding fault information (including the fault address, the faulty Dimm, the faulty Rank, Device, Bank Group, Bank, row/column, Cell fault, bit fault and other information) into the register 2 of the processor.
  • the register 2 counts the number of memory CEs. When it exceeds the threshold, the processor 1 triggers an interrupt.
  • the BIOS chip 3 responds to the interrupt and collects all the memory-related fault information recorded in the register 2 in the interrupt service program and reports it to the server out-of-band management system BMC.
  • the out-of-band management system BMC displays the memory error alarm of the corresponding slot of the faulty memory stick on its interface based on the above fault information.
  • the above method requires the operation and maintenance personnel to pay attention to the BMC interface in real time, and it cannot display the error status of the memory bar very intuitively.
  • the operation and maintenance personnel do not observe the prompt of the faulty memory bar on the BMC interface in time, they cannot replace the faulty memory bar or handle it accordingly, and the server system will be at risk of downtime.
  • the operation and maintenance personnel cannot observe the error status of the memory bar more intuitively, there is a certain risk of replacing the wrong memory bar.
  • an embodiment of the present application provides a device for detecting a memory bar fault.
  • the status indicator in the detection device sends an alarm signal, so that the fault state of the memory bar can be observed intuitively.
  • Figure 2 is a schematic diagram of the architecture of the memory bar fault detection device provided by this embodiment.
  • the memory stick fault detection device 10 provided in the embodiment of the present application includes a memory management controller 13, a latch 11, and a status indicator 12, wherein the input end of the memory management controller 13 is connected to the memory stick, the output end is connected to the input end of the latch 11, and the output end of the latch 11 is connected to the status indicator 12.
  • the memory management controller 13 is used to read the CE information of the memory bar and perform related actions.
  • the memory management controller 13 is used to detect the correctable error CE (Correctable Error) of the memory bar. Specifically, when the memory bar stores data, since the central processor inside the server has a certain error correction capability for the memory data, under normal circumstances, when the number of CEs in the memory bar is within a certain range, it can be considered that the data stored in the memory bar is data that the server can receive, and at this time, there is no risk of server downtime; however, when the number of CEs in the memory bar exceeds the preset threshold value, it is considered that there is a possibility of UCE in these memory CEs, and once the memory generates UCE, it will cause the server to restart abnormally or downtime.
  • CE Correctable Error
  • the preset threshold value can be set to 6000.
  • the preset threshold value is a pre-set value and can be set in the memory management controller 13 according to the capacity of the specific memory bar.
  • the memory management controller 13 is used to record the corresponding fault information (including fault address, fault type (including CE and UCE), faulty Dimm, faulty Rank, Device, BankGroup, Bank, row/column, Cell fault, bit fault, etc.) into the memory (not shown in the figure) in the memory management controller 13 when the CE fault in the memory is triggered due to being accessed.
  • fault information including fault address, fault type (including CE and UCE), faulty Dimm, faulty Rank, Device, BankGroup, Bank, row/column, Cell fault, bit fault, etc.
  • the memory may be arranged in the memory management controller 13, and the memory is used to store and record the fault information of the memory.
  • the memory may be any one of a register, a DRAM (dynamic random access memory), and a SRAM (static random access memory).
  • the memory may also be arranged in the memory stick 20, and the memory management controller 13 is further used to directly access the memory in the memory stick to obtain corresponding memory fault information.
  • the memory management controller 13 is also used to count the number of CEs in the memory bars in the memory and compare the number of CEs in the memory bars with its preset threshold. When the number of CEs in the memory bars counted by the memory management controller 13 is greater than the preset threshold, the memory management controller 13 counts the number of CEs in the memory bars and compares the number of CEs in the memory bars with its preset threshold. The controller 13 sends a latch signal to the latch 11, so that the latch 11 continuously outputs a high level.
  • a counter and a processor are provided in the memory management controller 13.
  • the counter is used to respond to the command of the processor and count the number of CEs of the memory bars in the memory.
  • the processor is used to send a command to the counter to make it count the number of CEs in the memory bar.
  • the processor is also used to compare the number of CEs counted by the counter with its preset threshold value. When the number of CEs counted by the counter is greater than its preset threshold value, the data counted by the counter is cleared and a latch signal is sent to the latch 11.
  • the preset threshold value can be set in advance according to the working environment and memory capacity of the specific memory module.
  • the memory management controller 13 may include: a BIOS chip 14 and a BMC 4, see FIG3, the input end of the BIOS chip 14 is connected to the memory bar 20, the BIOS chip 14 is connected to the BMC 4, the output end of the BMC 4 is connected to the input end of the latch 11, and the output end of the latch 11 is connected to the status indicator 12. It should be noted that the BIOS chip 14 and the BMC 4 may be the BIOS chip and the BMC of the computing device itself, which can save costs.
  • the BIOS chip 14 is used to store the BIOS program, which runs on the BIOS chip 14 ; the BIOS chip 14 is also used to obtain fault information in the memory bar 20 and to count the number of CEs in the memory bar.
  • the memory may be arranged in the BIOS chip 14, and the memory is used to store and record the fault information of the memory.
  • the memory may be any one of a register, a DRAM (dynamic random access memory), and a SRAM (static random access memory).
  • the memory is arranged on the memory stick 20, and the BIOS chip 14 is also used to directly access the memory in the memory stick to obtain corresponding memory fault information.
  • BMC4 is used to receive all memory fault information (including the number of CEs of the memory sticks) reported by the BIOS chip 14, and compare the number of CEs of the memory sticks counted by the BIOS chip 14 with its preset threshold. When the number of CEs of the memory sticks is greater than its preset threshold, BMC4 sends an interrupt signal to the BIOS chip 14 and sends a latch signal to the latch 11, so that the latch 11 continues to output a high level.
  • the BIOS chip 14 is also used to respond to the interrupt signal sent by the BMC 4, stop counting the number of CEs of the memory stick and clear the previous statistical data.
  • the preset threshold value can be set specifically according to the working environment and memory capacity of the specific memory stick.
  • the memory management controller 13 can also be a BIOS chip 14 or BMC4.
  • the BIOS chip 14 or BMC4 independently executes the acquisition of memory bar fault information, the counting of the number of memory bar CEs, and the comparison of the number of CEs with a preset threshold value, and sends a latch signal to the latch.
  • the execution principle is the same as the aforementioned embodiment.
  • the latch 11 is used to continuously output a high level after receiving a latch signal sent by the memory management controller 13, so that the state indicator 12 sends an alarm signal.
  • multiple memory sticks and multiple memories can be set inside the server, and each memory is used to store fault information of the memory stick corresponding thereto.
  • multiple latches can be set, and the multiple latches correspond one-to-one to the multiple memory sticks.
  • the multiple latches are connected to multiple status indicators.
  • VDD is the power supply voltage of the latch 11
  • VDD corresponds to different parameters according to the selection of different types of latches 11.
  • the latch 11 can be selected as an RS latch, 74L373, etc., as long as the function of latching the level can be realized, and the specific type of the latch 11 is not specifically limited here.
  • the output terminal ALERT_n connected to the memory management controller 13 and the latch 11 maintains a high level signal. At this time, the latch 11 outputs an invalid signal.
  • the memory management controller 13 detects that the number of CE failures of the memory stick exceeds the preset threshold value, it is considered that the memory stick is faulty.
  • the output terminal ALERT_n of the memory management controller 13 outputs a low level. After receiving the low level signal, the latch 11 continues to output a high level, thereby causing the status indicator to work and send out an alarm signal.
  • the status indicator 12 may be any one of an indicator light, an LCD display, a digital tube, a buzzer, or other devices with a warning function, which is not particularly limited here.
  • FIG. 6 is an illustration of an installation status indicator on a memory stick.
  • the memory stick 20 is connected to the memory management controller 13 and is detected by the memory management controller 13 .
  • the memory stick 20 includes a memory chip 16 , a PCB board 17 and pins 18 .
  • the memory chip 16 is used to store data
  • the pins 18 are used to provide connections between each memory chip 16 and the memory management controller 13 and other components on the server, and the pins 18 are also used to fix the memory bar 20 inside the server.
  • the PCB board 17 is used to achieve the connection between the memory chip 16 and the memory management controller 13 and other components of the server, and the PCB board 17 is also used to provide support and fixation for the memory chip 16.
  • the status indicator 12 is disposed on the memory stick 20. On the one hand, it is convenient to intuitively observe the fault information of the memory stick. On the other hand, the status indicator 12 is directly disposed on the PCB board 17 of the memory stick, and the power supply on the PCB board 17 can be directly used to provide power supply for the status indicator 12.
  • the status indicator 12 can also be set on a plug-in board connected to the memory stick 20 (not shown in the figure). As long as it is convenient to find the location of the faulty memory stick, there is no specific restriction on the installation position of the status indicator 12.
  • the memory stick detection device 10 provided in the above embodiment can, when the memory management controller 13 detects a memory stick failure, visually observe the specific faulty memory stick through the alarm signal emitted by the status indicator, and can promptly process the faulty memory stick, thereby eliminating the risk of incorrect replacement and reducing the risk of server system downtime.
  • the present application also provides a method for detecting a memory bar fault, which is applied to the fault detection device provided in the above device embodiment.
  • the method includes:
  • the memory fault detection device obtains fault information of the memory bar, and determines the number of correctable errors CE of the memory bar according to the fault information of the memory bar;
  • the memory fault detection device determines the faulty memory bar that exceeds the threshold value, and sends a latch signal to the latch corresponding to the faulty memory bar; the latch signal is used to: cause the status indicator to send an alarm signal.
  • the memory bar fault detection device further includes a memory management controller, a latch, and a status indicator, the input end of the memory management controller is connected to the memory bar, the output end of the memory management controller is connected to the input end of a latch, the output end of the latch is connected to a status indicator, and the memory management controller performs the above steps S11-S12.
  • the memory management controller can be selected from any one of a processor, a BIOS chip, a DSP chip, and a CPLD.
  • the memory management controller may further include a counter and a processor.
  • the memory management controller may further include a BIOS chip and a baseboard management controller BMC.
  • the memory stick fault detection device described in the aforementioned embodiment can be widely used in different computing scenarios, such as but not limited to supercomputers, HPC (High Performance Computing), intensive computing servers and the like.
  • a computer-readable storage medium is also provided, which is used to store at least one instruction, at least one program, code set or instruction set, and the at least one instruction, the at least one program, the code set or instruction set is loaded and executed by a processor to implement all or part of the steps in the above-mentioned memory bar fault detection method.
  • the computer-readable storage medium can be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, etc.
  • a computer program product or a computer program is also provided, the computer program product or the computer program comprising computer instructions, the computer instructions being stored in a computer-readable storage medium.
  • a processor of a computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computing device executes all or part of the steps of the method shown in any embodiment of FIG. 7 above.
  • the method shown in the embodiments of the present application can be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or encoded on other non-transitory media or products.
  • the computing device can be a server or a personal computer (PC).
  • PC personal computer

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

本申请公开了内存条的故障检测装置、方法,涉及计算设备的内存故障检测领域,使得内存故障时,能够及时确定故障的内存条,提高了内存故障检测的准确性和及时性。故障检测装置包括内存管理控制器、锁存器,所述内存管理控制器的输入端连接所述内存条,所述内存管理控制器的输出端连接所述锁存器的一个输入端;所述内存管理控制器用于检测内存条的可纠正错误CE并确定所述内存条的可纠正错误CE的数量,在所述内存条的可纠正错误CE数量超过预设门限值的情况下,所述内存管理控制器向所述锁存器发送锁存信号。

Description

一种内存条故障检测装置及检测方法
本申请要求于2022年10月18日提交中国专利局、申请号为202211274554.1、申请名称为“一种内存条故障检测装置及检测方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算设备的内存故障检测领域,尤其涉及一种计算设备的内存条的故障检测及方法。
背景技术
内存条(Random Access Memory),又名随机存取存储器,是与处理器(CPU)直接交换数据的内部存储器,通常作为操作系统或其他正在运行中的程序的临时数据存储介质。内存条工作时可以随时从任何一个指定的地址写入或读出信息,内存条是计算设备内使用最多和价值最高的器件。然而,随着内存架构的演进,为了在同样面积硅片中放入更多的存储单元,并进行更快速度的读写,内存制程越来越小,工作频率越来越高,为控制发热,工作电压被迫不断降低,使内存介质本身的可靠性不断降低,失效率越来越高,当前内存故障已经成为计算设备领域最普遍的故障源之一,而当内存故障时,严重时会导致计算设备异常重启或宕机,因此,对对内存条的健康状态的检测显得尤为重要,对其进行监控,可提前识别风险内存并处理,降低计算设备系统宕机风险。
内存条是被动器件,不能自行判断自身的健康状态。目前对内存条健康状态的识别依赖于带外管理系统BMC日志记录,然而,该识别方式不够直观,一方面,运维人员不能够及时发现内存故障,另一方面,在运维过程中存在有误更换内存条的可能性。内存的故障可以分为可纠正错误(Correctable Error,CE,以下简称CE)和不可纠正错误(UnCorrectable Error,UCE,以下简称UCE),内存的CE故障是可以被中央处理器纠正的,不会影响系统的正常运行,而当内存条的CE数量达到一定数值时,可认为有产生UCE的风险,而内存的UCE是无法被处理器纠正的故障,会导致服务器异常重启或宕机,因此,如何及时准确的检测内存的可纠正错误CE成为当下亟须解决的问题。
发明内容
本申请实施例提供了一种内存条的故障检测装置和方法,提高了内存条故障检测的及时性和准确性,有利于及时确定发生故障的内存条并对其进行更换,降低了计算设备宕机的风险。
为实现上述目的,本申请采用如下技术方案:
第一方面,提供一种内存条的故障检测装置,该故障检测装置包括内存管理控制器、锁存器,该内存管理控制器的输入端连接所述内存条,该内存管理控制器的输出端连接锁存器的一个输入端;内存管理控制器用于检测内存条的可纠正错误CE并确定内存条的可 纠正错误CE的数量,在内存条的可纠正错误CE数量超过预设门限值的情况下,内存管理控制器向所述锁存器发送锁存信号。
本技术方案,通过内存故障检测装置对内存条进行故障检测,通过在故障检测装置中设置内存管理控制器、锁存器,使得内存条的故障信息能够被内存管理控制器获取,并从该故障信息中确定内存可纠正错误CE的数量,在内存可纠正错误CE的数量超过预设门限时,该内存控制管理器向锁存器发送锁存信号,该锁存信号用于表示内存条的可纠正错误CE数量已经该达到一临界值,需要及时对内存条进行修复或者更换,通过内存管理控制器向锁存器发出的锁存信号,使得内存条的CE错误能够被及时发现,增强了内存条故障检测的及时性,降低了计算设备宕机的风险。
在一种可能的实现方式中,内存管理控制器还包括计数器和处理器,其中计数器用于对所述可纠正错误CE数量进行统计,处理器用于将计数器统计到的可纠正错误CE数量与预设门限值进行比较,在所述CE数量大于其预设门限值的情况下,向锁存器发送锁存信号。
该可能的实现方式中,使得内存管理控制器进一步包括计数器和处理器,由计数器对内存条的可纠正错误CE的数量进行统计,处理器将该数量与预设门限值进行比较,使得对内存条的可纠正错误CE的数量的统计更为准确。
在一种可能的实现方式中,内存管理控制器包括BIOS芯片和基板管理控制器BMC,通过BIOS芯片获取所述内存条的故障信息,根据内存条的故障信息,确定可纠正错误CE的数量,其中,内存条的故障信息包括可纠正错误CE的数量;该BIOS芯片还可以将可纠正错误CE的数量发送给基板管理控制器BMC;该基板管理控制器BMC通过接收所述BIOS芯片发送的所述可纠正错误CE的数量,并将可纠正错误CE的数量与预设门限值进行比较,在所述CE数量大于所述预设门限值的情况下,向所述锁存器发送锁存信号。
该可能的实现方式中,进一步给出了内存管理控制器的另一种构成方式,即可以采用计算设备本身所具有的BIOS芯片和基板管理控制器BMC实现内存管理控制器的功能,进一步降低了该内存条故障检测装置的成本。
在一种可能的实现方式中,故障检测装置还包括存储器,存储器设置于内存条或内存管理控制器中,所述存储器用于对所述内存条的故障信息进行存储记录。
该可能的实现方式中,通过在内存条或内存管理控制器中进一步设置用于保存内存条故障信息的存储器,使得内存条的故障信息不容易丢失,便于对内存条的维护。
在一种可能的实现方式中,内存条为多个,锁存器为多个且与多个内存条一一对应,在可纠正错误CE数量超过预设门限值的情况下,内存管理控制器确定超过门限阈值的目标故障内存条,并向与该目标故障内存条对应的锁存器发送锁存信号;其中所述目标故障内存条为多个所述内存条中的一个或多个。
该可能的实现方式中,使得在内存条为多个时,通过设置多个锁存器,且每个锁存器都与每个内存条连接,可以实现通过该内存管理控制器对多个内存条的故障检测,使得在任意一个内存条故障时,该内存管理控制器都能向与该内存条对应的锁存器发送锁存信号,实现对多个内存条故障的检测。
在一种可能的实现方式中,所述故障检测装置还包括状态指示件,所述状态指示件与所述锁存器的输出端连接;状态指示件用于:在锁存器接收到内存管理控制器发送的锁存 信号的情况下,接收所述锁存器输出高电平并发出报警信号。
在该可能的实现方式中,通过在故障检测装置中设置状态指示件,使得内存条的故障显示更为直观。
在一种可能的实现方式中,状态指示件为指示灯和/或蜂鸣器。
在该可能的实现方式中,通过进一步将状态指示件设置为指示灯或者蜂鸣器,使得内存条的故障信息通过灯信号或者声音信号来进行表征,进一步提升了内存条故障显示的直观性和有效性。
在一种可能的实现方式中,内存条还包括PCB电路板,状态指示件设置在该PCB电路板上或设置在计算设备的插槽上,插槽用于安装所述内存条。
在该可能的实现方式中,通过将状态指示件设置在内存条的电路板或者安装内存条的插槽上,有利于状态指示件的安装与维护。
在一种可能的实现方式中,所述计算设备为服务器。
在该可能的实现方式中,可以将该故障检测装置应用于服务器,从而提高了服务器内存故障检测的及时性和准确性,减少服务器的宕机风险。
第二方面,本申请实施例提供一种内存条故障检测方法,用于内存条故障检测装置中,所述内存条故障检测装置包括内存管理控制器、锁存器和状态指示件,所述内存管理控制器的输入端连接所述内存条,所述内存管理控制器的输出端连接一锁存器的输入端,所述锁存器的输出端连接一状态指示件,其特征在于,所述方法包括以下步骤:
S11:所述内存故障检测装置获取所述内存条的故障信息,根据所述内存条的故障信息,确定所述内存条的可纠正错误CE的数量;
S12:在所述内存条的可纠正错误CE的数量大于预设门限值的情况下,所述内存故障检测装置确定超过门限阈值的故障内存条,并向与所述故障内存条对应的锁存器发送锁存信号;所述锁存信号用于:使得所述状态指示件发出报警信号。
本技术方案,通过该内存条故障检测方法,使得内存条的故障信息能够被获取,并从该故障信息中确定内存可纠正错误CE的数量,在内存可纠正错误CE的数量超过预设门限时,向锁存器发送锁存信号,该锁存信号用于表示内存条的可纠正错误CE数量已经该达到一临界值,需要及时对内存条进行修复或者更换,通过该锁存信号,使得内存条的CE错误能够被及时发现,增强了内存条故障检测的及时性,降低了计算设备宕机的风险。
第三方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如上述第二方面所述的内存条故障检测方法。
第四方面,本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。终端的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该终端执行上述第二方面的各种可选实现方式中提供的内存故障预测方法。
附图说明
为了更清楚地说明本申请实施例或现有技术中的方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
图1为相关技术中内存故障检测的示意图;
图2为本申请实施例提供的一种内存条故障检测装置的示意图;
图3为本申请实施例提供的另一种内存条故障检测装置的示意图;
图4为本申请实施例提供的一种多个内存条故障检测的连接示意图;
图5为本申请实施例提供的一种检测电路示意图;
图6为本申请实施例提供的一种状态指示件安装示意图;
图7为本申请实施例提供的一种内存条故障检测方法流程图。
具体实施方式
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请,下面将结合附图对本申请实施例的实施方式进行详细描述。
首先介绍本申请中涉及的专用名词。
CPU:中央处理器(central processing unit),是计算设备的运算和控制核心,是信息处理、程序运行的最终执行单元。
寄存器:寄存器是CPU内用来暂存指令、数据和地址的存储器。寄存器是有限存储容量的高速存储部件,可以用来暂存指令、数据和地址。
BIOS:基本输入输出系统(Basic Input Output System),是在通电启动阶段执行硬件初始化,以及为操作系统提供运行时服务的固件。BIOS多存储于闪存芯片上,便于BIOS的更新。
BMC:基板管理控制器(Baseboard Management Controller),可以在计算设备未开机的状态下,对设备进行固件升级,对设备的运行状态进行管理以及排出故障等。基板管理控制器可以对计算设备内的存储器中的程序代码进行维护,包括升级或恢复等。基板管理控制器还可以对计算设备内的电源电路或时钟电路进行控制等。
行故障:为内存中行(Row)发生的可纠正错误(corrected error,CE)的故障或不可纠正错误(uncorrected error,UCE)的故障。其中,内存的物理粒度从大到小依次为:Dimm、Rank、Device、Bank、Row/Column、Cell、Bit;其该多个物理粒度之间的关系具体为:每个计算机设备可以包括多个内存条(Dimm),每个内存条上具有两个内存列(rank),分别位于内存的两个面,例如,两个内存列分别为内存列0(rank0)和内存列1(rank1)。其中,每个内存列上可以配置多个内存芯片(Chip)用于存储数据,内存芯片也可以称为内存颗粒(device),内存芯片可以是动态随机存取存储器(dynamic random access memory,DRAM)、静态随机存取存储器(static random access memory,SRAM)等,每个内存芯片可以划分为多个存储阵列(bank),内存芯片在存储数据时,该数据以位(bit)为单位写入一个存储阵列中。另外,还可以将多个存储阵列归为一个存储阵列组(bankgroup),其中,每个存储阵列组的存储阵列的数量可以相同,或者,也可以不同。存储阵列由大量的存储单元(cell)组成,大量的存储单元排列成二维矩阵形式,只要指定存储阵列上的行(row)和列(column),则可以在存储阵列上定位一个存储单元,内存发生故障的最小单位为存储阵列上的存储单元。也就是说,内存故障包括:Dimm故障、Rank故障、Device故障、Bank故障、Row故障/Column故障、Cell故障以及Bit故障中的至少一种。
锁存器:是一种对脉冲电平敏感的存储单元电路,可以在特定输入脉冲电平作用下改 变状态,把信号暂存以维持某种电平状态。
本申请实施例计算设备以服务器为例,现有用于服务器内存条的故障检测方法,具体可参照图1所示,处理器1检测内存条的CE,具体而言,处理器1会将对应的故障信息(包含故障地址、故障的Dimm、故障的Rank、Device,BankGroup,Bank,行/列、Cell故障、bit故障等信息)记录到处理器的寄存器2中,寄存器2对内存CE数量进行统计,当其超过门限阈值时,处理器1触发中断,BIOS芯片3会响应此中断并在中断服务程序中将记录在寄存器2中的内存相关的故障信息全部收集后上报给服务器带外管理系统BMC,带外管理系统BMC基于上述故障信息在其界面显示故障内存条对应槽位的内存错误告警。
然而,上述方法需要运维人员实时关注BMC的界面,并且不能够很直观的显示内存条的错误状态。一方面,若运维人员没有及时观察到BMC界面对故障内存条的提示,就不能及时对故障内存条进行更换或相应处理,服务器系统就会有宕机风险。一方面,由于运维人员不能够较为直观的观测到内存条的错误状态,因此,存在一定的误更换内存条的风险。
基于此,本申请实施例提供一种内存条故障检测的装置,通过设置检测装置,当处理器检测到内存故障时,使得检测装置中的状态指示件发出报警信号,从而使得可以直观的观测到内存条的故障状态。图2为本实施例提供的内存条故障检测装置的架构图示意图。
参照图2所示,本申请实施例提供的内存条故障检测装置10包括内存管理控制器13、锁存器11、状态指示件12,其中内存管理控制器13的输入端连接内存条,输出端连接锁存器11的输入端,锁存器11的输出端连接状态指示件12。
其中,内存管理控制器13用于读取内存条的CE信息并执行相关动作。
内存管理控制器13用于检测内存条的可纠正错误CE(Correctable Error),具体地,在内存条对数据进行存储时,由于服务器内部的中央处理器对内存数据具有一定的纠错能力,因此,通常情况下,当内存条中的CE数量在一定范围内时,可认为内存条中存储的数据是服务器可接收的数据,此时服务器不存在宕机风险;然而,当内存条中的CE数量超过预设的门限值时,则认为这些内存CE中有产生UCE的可能,而一旦内存产生了UCE,将会导服务器异常重启或宕机,因此,运维人员需要将有产生内存UCE风险的内存条找出并及时更换。示例性的,当内存条的存储容量为10亿bit时,可将该预设的门限值设置为6000,该预设门限值为预先设定的值,可根据具体内存条的容量在内存管理控制器13中进行设置。
内存管理控制器13用于,当内存中的CE故障由于被访问到而被触发,内存管理控制器13会将对应的故障信息(包含故障地址、故障类型(包括CE和UCE)、故障的Dimm、故障的Rank、Device,BankGroup,Bank,行/列、Cell故障、bit故障等信息)记录到内存管理控制器13中的存储器(图中未示出)中。
存储器可以设置于内存管理控制器13中,存储器用于对内存的故障信息进行存储记录,可选地,存储器可以为寄存器、DRAM(动态随机存取存储器)、SRAM(静态随机存取存储器)中的任意一种。
可选的,存储器还可以设置于内存条20中,内存管理控制器13还用于直接访问内存条中的存储器获取对应的内存故障信息。
内存管理控制器13还用于对存储器中的内存条的CE数量进行统计,并将内存条CE的数量与其预设门限进行比较,当其统计到的内存条CE的数量大于其预设门限值时,内存管 理控制器13向锁存器11发送锁存信号,使得锁存器11持续输出高电平。
具体地,内存管理控制器13中设置有计数器以及处理器(图中未示出),该计数器用于响应处理器的命令,对存储器中的内存条的CE数量进行统计。
处理器用于向计数器发送命令,使其对内存条CE数量进行统计,处理器还用于将计数器统计到的CE数量与其预设门限值相比较,当计数器统计到的CE数量大于其预设门限值时,将计数器所统计的数据清零,并向锁存器11发送锁存信号。
需要说明书的是,预设门限值为可以预先根据具体内存条的工作环境以及内存容量而具体设置。
在一种实施方式中,内存管理控制器13可以包括:BIOS芯片14和BMC4,参见图3,BIOS芯片14的输入端连接内存条20,BIOS芯片14与BMC4相连接,BMC4的输出端连接锁存器11的输入端,锁存器11的输出端连接状态指示件12。需要说明的是,BIOS芯片14和BMC4可以采用计算设备自身具有的BIOS芯片和BMC,可以节约成本。
BIOS芯片14用于存储BIOS程序,BIOS程序运行在BIOS芯片14上;BIOS芯片14还用于获取内存条20中的故障信息,并对内存条的CE数量进行统计。
存储器可设置于BIOS芯片14中,存储器用于对内存的故障信息进行存储记录。可选地,存储器可以为寄存器、DRAM(动态随机存取存储器)、SRAM(静态随机存取存储器)中的任意一种。
可选的,存储器设置于内存条20上,BIOS芯片14还用于直接访问内存条中的存储器获取对应的内存故障信息。
BMC4用于接收BIOS芯片14上报的全部内存故障信息(包括内存条的CE数量),将BIOS芯片14统计到的内存条CE的数量与其预设门限进行比较,当内存条CE数量大于其预设门限值时,BMC4向BIOS芯片14发送中断信号,并向锁存器11发送锁存信号,使得锁存器11持续输出高电平。
BIOS芯片14还用于,响应BMC4发送的中断信号,停止对内存条的CE数量的统计动作并将之前的统计数据清零。
需要说明书的是,预设门限值为可以根据具体内存条的工作环境以及内存容量而具体设置。
在其他实施方式中,内存管理控制器13还可以为BIOS芯片14或BMC4,此时,由BIOS芯片14或BMC4单独执行对内存条故障信息的获取、内存条CE数量的统计以及CE数量与预设门限值的比较,并向锁存器发送锁存信号,其执行原理与前述实施方式相同。
锁存器11用于在接收到内存管理控制器13发送过来的锁存信号后,持续输出高电平,使得状态指示件12发出告警信号。
在一种实施方式中,参照图4,服务器内部可以设置多个内存条,多个存储器,每一个存储器用于存储与其相对应的内存条的故障信息,为了便于对不同内存条的健康状态进行监测,可以设置多个锁存器,多个锁存器与多个内存条一一对应,多个锁存器连接多个状态指示件,当内存管理控制器13统计到的任意一个内存条CE故障的数量大于其预设门限值时,内存管理控制器13通过存储器中存储的内存故障信息查找超过门限阈值的故障内存条,并向与该故障内存条对应的锁存器发送锁存信号。
参照图5所示,图5为检测电路示意图,其中锁存器11的两个输入端分别连接VDD与 内存管理控制器13的一个输出端,锁存器11的输出端连接状态指示件12的一端,状态指示件12的另一端接地Vss。其中VDD为锁存器11的电源电压,根据锁存器11不同类型的选择,VDD对应不同的参数,可选地,锁存器11可选择为R-S锁存器、74L373等,只要能够实现锁存电平的功能即可,此处对锁存器11的具体类型不做具体限制。
具体地,内存条正常工作状态下,内存管理控制器13与锁存器11相连接的输出端ALERT_n保持高电平信号,此时锁存器11输出无效信号,当内存管理控制器13检测到内存条的CE故障数量超过预设门限值,认为该内存条故障,此时,内存管理控制器13的输出端ALERT_n则输出低电平,锁存器11接收到该低电平信号后,持续输出高电平,从而使得状态指示件工作而发出告警信号。
可选的,状态指示件12可以是指示灯、LCD显示屏、数码管、蜂鸣器中的任意一种,或者其他具有警示作用的器件,此处不做特别限制。
参见图6,图6为内存条上安装状态指示件意图,图5中内存条20与内存管理控制器13连接并接受内存管理控制器13的检测,内存条20包括:内存芯片16,PCB板17以及针脚18。
其中,内存芯片16用于存储数据,针脚18用于提供各内存芯片16与内存管理控制器13以及服务器上其他组件的连接,针脚18还用于将内存条20固定于服务器内部。PCB板17用于实现内存芯片16与内存管理控制器13以及服务器其他组件的连接,PCB板17还用于对内存芯片16提供支撑与固定。
本申请的一个实施例将状态指示件12设置于内存条20上,一方面便于直观地观测到该内存条的故障信息,一方面将状态指示件12直接设置于内存条的PCB板17上,可以直接利用PCB板17上的电源为状态指示件12提供供电电源。
可选地,状态指示件12还可以设置在与内存条20连接的插板上(图中未示出),只要能便于查找到故障内存条的位置即可,此处对状态指示件12的安装位置不做具体限制。
上述实施例提供的内存条检测装置10能够在内存管理控制器13检测到内存条故障时,通过状态指示件的发出的告警信号,能够直观地观测到具体故障的内存条,能对该故障的内存条进行及时处理,杜绝了误更换的风险,降低服务器系统宕机风险。
本申请实施例还提供一种内存条故障的检测方法,应用于上述装置实施例提供的故障检测装置,参照图7所示该方法包括:
S11:所述内存故障检测装置获取所述内存条的故障信息,根据所述内存条的故障信息,确定所述内存条的可纠正错误CE的数量;
S12:在所述内存条的可纠正错误CE的数量大于预设门限值的情况下,所述内存故障检测装置确定超过门限阈值的故障内存条,并向与所述故障内存条对应的锁存器发送锁存信号;所述锁存信号用于:使得所述状态指示件发出报警信号。在一种实施方式中,内存条故障检测装置进一步包括内存管理控制器、锁存器和状态指示件,所述内存管理控制器的输入端连接所述内存条,所述内存管理控制器的输出端连接一锁存器的输入端,所述锁存器的输出端连接一状态指示件,由内存管理控制器执行上述步骤S11-S12。
在一种实现方式中,内存管理控制器可以选自处理器、BIOS芯片、DSP芯片、CPLD中的任意一种。
在一种实现方式中,内存管理控制器还可以包括计数器和处理器。
在一种实现方式中,内存管理控制器还可以包括BIOS芯片和基板管理控制器BMC。
前述实施例所描述的内存条故障检测装置,可广泛应用不同的计算场景,例如但不限于,超级计算机、HPC(High Performance Computing,高性能计算机群),密集计算型服务器等场景下。
在一示例性实施例中,还提供了一种计算机可读存储介质,用于存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述内存条故障检测方法中的全部或部分步骤。例如,该计算机可读存储介质可以是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、只读光盘(compact disc read-only memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在一示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算设备执行上述图7任一实施例所示方法的全部或部分步骤。
在一些实施例中,本申请实施例所示的方法可以实施为以机器可读格式被编码在计算机可读存储介质上的或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
在具体应用中,该计算设备可以为服务器,也可以为个人电脑(personal computer,PC)。
应当理解,相应计算设备的其他功能构成非本申请的核心发明点所在,故本文不再赘述。以上仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (10)

  1. 一种内存条的故障检测装置,用于计算设备中,其特征在于,所述故障检测装置包括内存管理控制器、锁存器,所述内存管理控制器的输入端连接所述内存条,所述内存管理控制器的输出端连接所述锁存器的一个输入端;所述内存管理控制器用于检测内存条的可纠正错误CE并确定所述内存条的可纠正错误CE的数量,在所述内存条的可纠正错误CE数量超过预设门限值的情况下,所述内存管理控制器向所述锁存器发送锁存信号。
  2. 根据权利要求1所述的故障检测装置,其特征在于,所述内存管理控制器包括计数器和处理器,其中所述计数器用于对所述可纠正错误CE数量进行统计,所述处理器用于将计数器统计到的可纠正错误CE数量与所述预设门限值进行比较,在所述CE数量大于所述预设门限值的情况下,向锁存器发送锁存信号。
  3. 根据权利要求1所述的故障检测装置,其特征在于,所述内存管理控制器包括BIOS芯片和基板管理控制器BMC,所述BIOS芯片用于获取所述内存条的故障信息,根据所述内存条的故障信息,确定所述可纠正错误CE的数量,其中所述内存条的故障信息包括所述可纠正错误CE的数量;
    所述BIOS芯片还用于,将所述可纠正错误CE的数量发送给所述基板管理控制器BMC;
    所述基板管理控制器BMC用于接收所述BIOS芯片发送的所述可纠正错误CE的数量,并将所述可纠正错误CE的数量与所述预设门限值进行比较,当所述CE数量大于所述预设门限值时,向所述锁存器发送锁存信号。
  4. 根据权利要求1-3中任一项所述的故障检测装置,其特征在于,所述故障检测装置还包括存储器,所述存储器设置于所述内存条或所述内存管理控制器中,所述存储器用于对所述内存条的故障信息进行存储记录。
  5. 根据权利要求1-4中任一项所述的故障检测装置,其特征在于,所述内存条为多个,所述锁存器为多个且与所述多个内存条一一对应,在所述可纠正错误CE数量超过所述预设门限值的情况下,所述内存管理控制器确定超过门限阈值的目标故障内存条,并向与所述目标故障内存条对应的锁存器发送锁存信号;其中所述目标故障内存条为多个所述内存条中的一个或多个。
  6. 根据权利要求1-5中任一项所述的故障检测装置,其特征在于,所述故障检测装置还包括状态指示件,所述状态指示件与所述锁存器的输出端连接;
    所述状态指示件用于:在所述锁存器接收到所述内存管理控制器发送的锁存信号的情况下,接收所述锁存器输出高电平并发出报警信号。
  7. 根据权利要求6所述的故障检测装置,其特征在于,所述状态指示件为指示灯和/或蜂鸣器。
  8. 根据权利要求1-7中任一项所述的故障检测装置,其特征在于,所述内存条还包括PCB电路板,所述状态指示件设置在所述PCB电路板上或设置在所述计算设备的插槽上,所述插槽用于安装所述内存条。
  9. 根据权利要求1-8中任一项所述的故障检测装置,其特征在于,所述计算设备为服务器。
  10. 一种内存条故障检测方法,用于内存条故障检测装置中,所述内存条故障检测装置包括内存管理控制器、锁存器和状态指示件,所述内存管理控制器的输入端连接所述内存 条,所述内存管理控制器的输出端连接一锁存器的输入端,所述锁存器的输出端连接一状态指示件,其特征在于,所述方法包括以下步骤:
    S11:所述内存故障检测装置获取所述内存条的故障信息,根据所述内存条的故障信息,确定所述内存条的可纠正错误CE的数量;
    S12:在所述内存条的可纠正错误CE的数量大于预设门限值的情况下,所述内存故障检测装置确定超过门限阈值的故障内存条,并向与所述故障内存条对应的锁存器发送锁存信号;所述锁存信号用于:使得所述状态指示件发出报警信号。
PCT/CN2023/116850 2022-10-18 2023-09-04 一种内存条故障检测装置及检测方法 WO2024082844A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211274554.1A CN115480947A (zh) 2022-10-18 2022-10-18 一种内存条故障检测装置及检测方法
CN202211274554.1 2022-10-18

Publications (1)

Publication Number Publication Date
WO2024082844A1 true WO2024082844A1 (zh) 2024-04-25

Family

ID=84395451

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/116850 WO2024082844A1 (zh) 2022-10-18 2023-09-04 一种内存条故障检测装置及检测方法

Country Status (2)

Country Link
CN (1) CN115480947A (zh)
WO (1) WO2024082844A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115480947A (zh) * 2022-10-18 2022-12-16 超聚变数字技术有限公司 一种内存条故障检测装置及检测方法
CN117076186B (zh) * 2023-10-17 2024-02-09 苏州元脑智能科技有限公司 一种内存故障检测方法、系统、装置、介质及服务器

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6075929A (en) * 1996-06-05 2000-06-13 Compaq Computer Corporation Prefetching data in response to a read transaction for which the requesting device relinquishes control of the data bus while awaiting data requested in the transaction
CN102411532A (zh) * 2011-12-31 2012-04-11 曙光信息产业股份有限公司 计算机故障提示方法和装置、以及计算机
CN113535509A (zh) * 2021-06-10 2021-10-22 中国长城科技集团股份有限公司 内存条异常检测方法、装置及bmc
CN114064333A (zh) * 2020-08-05 2022-02-18 华为技术有限公司 一种内存故障处理方法和装置
CN114090316A (zh) * 2021-11-15 2022-02-25 北京字节跳动网络技术有限公司 内存故障处理方法、装置、存储介质及电子设备
CN115480947A (zh) * 2022-10-18 2022-12-16 超聚变数字技术有限公司 一种内存条故障检测装置及检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6075929A (en) * 1996-06-05 2000-06-13 Compaq Computer Corporation Prefetching data in response to a read transaction for which the requesting device relinquishes control of the data bus while awaiting data requested in the transaction
CN102411532A (zh) * 2011-12-31 2012-04-11 曙光信息产业股份有限公司 计算机故障提示方法和装置、以及计算机
CN114064333A (zh) * 2020-08-05 2022-02-18 华为技术有限公司 一种内存故障处理方法和装置
CN113535509A (zh) * 2021-06-10 2021-10-22 中国长城科技集团股份有限公司 内存条异常检测方法、装置及bmc
CN114090316A (zh) * 2021-11-15 2022-02-25 北京字节跳动网络技术有限公司 内存故障处理方法、装置、存储介质及电子设备
CN115480947A (zh) * 2022-10-18 2022-12-16 超聚变数字技术有限公司 一种内存条故障检测装置及检测方法

Also Published As

Publication number Publication date
CN115480947A (zh) 2022-12-16

Similar Documents

Publication Publication Date Title
WO2024082844A1 (zh) 一种内存条故障检测装置及检测方法
EP1000395B1 (en) Apparatus and method for memory error detection and error reporting
EP3660681B1 (en) Memory fault detection method and device, and server
US6012148A (en) Programmable error detect/mask utilizing bus history stack
KR100337218B1 (ko) 스크루빙 및 스페어링을 향상시킨 컴퓨터 램 메모리 시스템
US12014791B2 (en) Memory fault handling method and apparatus, device, and storage medium
US8108724B2 (en) Field replaceable unit failure determination
US20090150721A1 (en) Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System
JP2004537787A (ja) コンピュータ・システムにおける電力障害を解析する方法および装置
US20030236998A1 (en) Method and system for configuring a computer system using field replaceable unit identification information
JP2012113466A (ja) メモリコントローラ及び情報処理システム
US6550019B1 (en) Method and apparatus for problem identification during initial program load in a multiprocessor system
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
US11263083B1 (en) Method and apparatus for selective boot-up in computing devices
WO2023193396A1 (zh) 一种内存故障处理方法、装置及计算机可读存储介质
JP2017091077A (ja) 擬似故障の発生プログラム、発生方法、及び発生装置
CN115640174A (zh) 内存故障预测方法、系统、中央处理单元及计算设备
CN115509786A (zh) 一种报告故障的方法、装置、设备及介质
CN114860487A (zh) 一种内存故障识别方法及一种内存故障隔离方法
CN115705261A (zh) 内存故障的修复方法、cpu、os、bios及服务器
CN117909109A (zh) 一种内存错误信息处理方法及计算设备
US11914703B2 (en) Method and data processing system for detecting a malicious component on an integrated circuit
CN114265489A (zh) 电源故障监测方法、装置、电子设备及存储介质
US20200111539A1 (en) Information processing apparatus for repair management of storage medium
JP2008027284A (ja) 障害処理システム、障害処理方法、障害処理装置およびプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878839

Country of ref document: EP

Kind code of ref document: A1