CN115480947A - Memory bank fault detection device and detection method - Google Patents

Memory bank fault detection device and detection method Download PDF

Info

Publication number
CN115480947A
CN115480947A CN202211274554.1A CN202211274554A CN115480947A CN 115480947 A CN115480947 A CN 115480947A CN 202211274554 A CN202211274554 A CN 202211274554A CN 115480947 A CN115480947 A CN 115480947A
Authority
CN
China
Prior art keywords
memory
memory bank
latch
management controller
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211274554.1A
Other languages
Chinese (zh)
Inventor
王为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202211274554.1A priority Critical patent/CN115480947A/en
Publication of CN115480947A publication Critical patent/CN115480947A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair

Abstract

The application discloses a fault detection device and method for a memory bank, and relates to the field of memory fault detection of computing equipment, so that the failed memory bank can be determined in time when the memory fails, and the accuracy and timeliness of memory fault detection are improved. The fault detection device comprises a memory management controller and a latch, wherein the input end of the memory management controller is connected with the memory bank, and the output end of the memory management controller is connected with one input end of the latch; the memory management controller is configured to detect correctable errors CE of a memory bank and determine the number of correctable errors CE of the memory bank, and send a latch signal to the latch when the number of correctable errors CE of the memory bank exceeds a preset threshold.

Description

Memory bank fault detection device and detection method
Technical Field
The present application relates to the field of memory failure detection of computing devices, and in particular, to a failure detection method for a memory bank of a computing device.
Background
A Memory bank (Random Access Memory), also known as a Random Access Memory, is an internal Memory that exchanges data directly with a processor (CPU), typically as a temporary data storage medium for an operating system or other program that is running. When in operation, the memory bank can write or read information from any one designated address at any time, and the memory bank is the most used and most valuable device in the computing equipment. However, with the evolution of the memory architecture, in order to put more storage units into a silicon chip with the same area and perform faster reading and writing, the memory process is smaller and smaller, the operating frequency is higher and higher, and the operating voltage is forced to be continuously reduced for controlling heat generation, so that the reliability of the memory medium is continuously reduced, the failure rate is higher and higher, the current memory fault becomes one of the most common fault sources in the field of computing devices, and when the memory fault occurs, the computing device is abnormally restarted or crashed seriously, so the detection of the health state of the memory bank is particularly important, the monitoring of the health state of the memory bank is performed, the risky memory can be identified and processed in advance, and the risk of crashing of the computing device system is reduced.
The memory bank is a passive device and cannot automatically judge the health state of the memory bank. At present, the identification of the health state of the memory bank depends on the log record of the BMC of the out-of-band management system, however, the identification mode is not intuitive enough, on one hand, operation and maintenance personnel cannot find out the memory fault in time, and on the other hand, the possibility of replacing the memory bank by mistake exists in the operation and maintenance process.
The memory fault can be classified into a Correctable Error (CE, hereinafter, CE) and an UnCorrectable Error (UCE, hereinafter, UCE), the CE fault of the memory can be corrected by the cpu, which does not affect the normal operation of the system, and when the number of CEs of a memory bank reaches a certain value, it is considered that there is a risk of producing UCE, and the UCE of the memory is a fault that cannot be corrected by the processor, which may cause an abnormal restart or downtime of the server, so how to timely and accurately detect the Correctable Error CE of the memory becomes a problem that needs to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a fault detection device and method for a memory bank, which improve the timeliness and accuracy of fault detection of the memory bank, are favorable for timely determining and replacing the failed memory bank, and reduce the risk of downtime of computing equipment.
In order to achieve the purpose, the following technical scheme is adopted in the application:
in a first aspect, a fault detection apparatus for a memory bank is provided, where the fault detection apparatus includes a memory management controller and a latch, an input end of the memory management controller is connected to the memory bank, and an output end of the memory management controller is connected to one input end of the latch; the memory management controller is used for detecting the correctable errors CE of the memory bank and determining the number of the correctable errors CE of the memory bank, and under the condition that the number of the correctable errors CE of the memory bank exceeds a preset threshold value, the memory management controller sends a latch signal to the latch.
According to the technical scheme, the memory fault detection device is used for carrying out fault detection on the memory bank, the memory management controller and the latch are arranged in the fault detection device, so that fault information of the memory bank can be acquired by the memory management controller, the number of the error correctable CE in the memory is determined from the fault information, when the number of the error correctable CE in the memory exceeds a preset threshold, the memory control manager sends a latch signal to the latch, the latch signal is used for indicating that the number of the error correctable CE in the memory bank reaches a critical value, the memory bank needs to be repaired or replaced in time, and the latch signal sent to the latch by the memory management controller enables the CE error in the memory bank to be discovered in time, so that the timeliness of the fault detection of the memory bank is enhanced, and the risk of breakdown of computing equipment is reduced.
In a possible implementation manner, the memory management controller further includes a counter and a processor, where the counter is configured to count the number of correctable errors CE, and the processor is configured to compare the number of correctable errors CE counted by the counter with a preset threshold, and send a latch signal to the latch when the number of CEs is greater than the preset threshold.
In this possible implementation manner, the memory management controller further includes a counter and a processor, the counter counts the number of correctable errors CE in the memory bank, and the processor compares the number with a preset threshold value, so that the count of the number of correctable errors CE in the memory bank is more accurate.
In a possible implementation manner, the memory management controller includes a BIOS chip and a baseboard management controller BMC, and acquires the fault information of the memory bank through the BIOS chip, and determines the number of correctable error CEs according to the fault information of the memory bank, where the fault information of the memory bank includes the number of correctable error CEs; the BIOS chip can also send the number of the error-correctable CEs to a baseboard management controller BMC; the baseboard management controller BMC sends a latch signal to the latch unit when the number of the CEs is larger than the preset threshold value by receiving the number of the error correctable CEs sent by the BIOS chip and comparing the number of the error correctable CEs with the preset threshold value.
In the possible implementation manner, another configuration manner of the memory management controller is further provided, that is, the BIOS chip and the BMC of the computing device itself may be used to implement the function of the memory management controller, so that the cost of the memory bank fault detection apparatus is further reduced.
In a possible implementation manner, the fault detection apparatus further includes a memory, where the memory is disposed in the memory bank or the memory management controller, and the memory is used for storing and recording fault information of the memory bank.
In this possible implementation manner, a memory for storing the failure information of the memory bank is further disposed in the memory bank or the memory management controller, so that the failure information of the memory bank is not easily lost, and the maintenance of the memory bank is facilitated.
In a possible implementation manner, the memory bank is multiple, the latches are multiple and correspond to the multiple memory banks one by one, and under the condition that the number of correctable errors CE exceeds a preset threshold value, the memory management controller determines a target failure memory bank exceeding the threshold value and sends a latch signal to the latch corresponding to the target failure memory bank; wherein the target failing memory bank is one or more of the plurality of memory banks.
In this possible implementation manner, when a plurality of memory banks are provided, by setting a plurality of latches, and each latch is connected to each memory bank, fault detection of the plurality of memory banks by the memory management controller can be achieved, so that when any one memory bank has a fault, the memory management controller can send a latch signal to the latch corresponding to the memory bank, thereby achieving fault detection of the plurality of memory banks.
In a possible implementation manner, the fault detection apparatus further includes a status indicator, and the status indicator is connected to the output end of the latch; the status indicator is for: and under the condition that the latch receives a latch signal sent by the memory management controller, receiving the high level output by the latch and sending an alarm signal.
In this possible implementation, the fault display of the memory bank is made more intuitive by providing the status indicator in the fault detection device.
In one possible implementation, the status indicator is an indicator light and/or a buzzer.
In this possible implementation, the state indicator is further set as an indicator light or a buzzer, so that the fault information of the memory bank is represented by a light signal or a sound signal, and the intuitiveness and the effectiveness of fault display of the memory bank are further improved.
In a possible implementation manner, the memory stick further includes a PCB circuit board, and the status indicator is disposed on the PCB circuit board or disposed on a slot of the computing device, where the slot is used for installing the memory stick.
In this possible implementation, the status indicator is disposed on the circuit board of the memory bank or the slot on which the memory bank is mounted, so that the status indicator is mounted and maintained.
In one possible implementation, the computing device is a server.
In the possible implementation mode, the fault detection device can be applied to the server, so that the timeliness and the accuracy of the fault detection of the server memory are improved, and the downtime risk of the server is reduced.
In a second aspect, an embodiment of the present application provides a method for detecting a failure of a memory bank, where the method is used in a memory bank failure detection device, where the memory bank failure detection device includes a memory management controller, a latch, and a status indicator, an input end of the memory management controller is connected to the memory bank, an output end of the memory management controller is connected to an input end of the latch, and an output end of the latch is connected to the status indicator, where the method includes the following steps:
s11: the memory fault detection device acquires fault information of the memory bank, and determines the number of correctable errors CE of the memory bank according to the fault information of the memory bank;
s12: under the condition that the number of the correctable errors CE of the memory bank is larger than a preset threshold value, the memory fault detection device determines the fault memory bank exceeding the threshold value and sends a latch signal to a latch corresponding to the fault memory bank; the latch signal is used to: causing the status indicator to emit an alarm signal.
According to the technical scheme, the fault information of the memory bank can be acquired through the memory bank fault detection method, the number of the CE which can correct errors in the memory is determined from the fault information, when the number of the CE which can correct errors in the memory exceeds a preset threshold, a latch signal is sent to the latch and is used for indicating that the number of the CE which can correct errors in the memory bank reaches a critical value, the memory bank needs to be repaired or replaced in time, the CE errors in the memory bank can be found in time through the latch signal, the timeliness of the memory bank fault detection is enhanced, and the risk of downtime of computing equipment is reduced.
In a third aspect, the present application provides a computer-readable storage medium, where at least one computer program is stored, where the computer program is loaded and executed by a processor to implement the memory bank fault detection method according to the second aspect.
In a fourth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the terminal reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the terminal executes the memory failure prediction method provided in the various optional implementation manners of the second aspect.
Drawings
In order to more clearly illustrate the embodiments or prior art solutions of the present invention, the drawings that are needed in the description of the embodiments or prior art will be briefly described below.
FIG. 1 is a diagram illustrating memory failure detection in the related art;
fig. 2 is a schematic diagram of a memory bank fault detection apparatus according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of another memory bank fault detection apparatus according to an embodiment of the present disclosure;
fig. 4 is a connection diagram illustrating a failure detection of a plurality of memory banks according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a detection circuit according to an embodiment of the present disclosure;
FIG. 6 is a schematic view of a status indicator installation provided by an embodiment of the present application;
fig. 7 is a flowchart of a method for detecting a memory bank fault according to an embodiment of the present disclosure.
Detailed Description
The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application, which will be described in detail below with reference to the accompanying drawings.
First, terms referred to in the present application will be described.
A CPU: the central processing unit (central processing unit) is an operation and control core of the computing device, and is a final execution unit for information processing and program operation.
Register: registers are memory used within the CPU to temporarily store instructions, data, and addresses. Registers are high-speed storage elements of limited storage capacity that may be used to temporarily store instructions, data, and addresses.
BIOS: the Basic Input Output System (bios) is firmware that performs hardware initialization during the power-on startup phase and provides runtime services for the operating System. The BIOS is stored in the flash memory chip, which is convenient for updating the BIOS.
BMC: a Baseboard Management Controller (Baseboard Management Controller) may upgrade firmware of a computing device, manage an operation state of the device, and discharge a fault, when the computing device is not powered on. The baseboard management controller can perform maintenance, including upgrades or restores, etc., on program code in memory within the computing device. The baseboard management controller may also control power circuits or clock circuits within the computing device, etc.
And (4) line fault: a failure of a Correctable Error (CE) or a failure of an uncorrectable error (UCE) occurring for a Row in memory (Row). The physical granularity of the memory is as follows from big to small in sequence: dimm, rank, device, bank, row/Column, cell, bit; the relationship among the multiple physical granularities is specifically as follows: each computer device may include a plurality of memory banks (Dimm), each of which has two memory columns (rank) respectively located on two sides of the memory, for example, memory column 0 (rank 0) and memory column 1 (rank 1). Each memory Chip may be a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), or the like, and may be divided into a plurality of memory arrays (banks), and when the memory chips store data, the data is written into one memory array in units of bits (bits). In addition, the plurality of storage arrays may be classified into a storage array group (bankgroup), where the number of storage arrays in each storage array group may be the same or different. The memory array is composed of a large number of memory cells (cells), the large number of memory cells are arranged in a two-dimensional matrix form, as long as rows (row) and columns (column) on the memory array are specified, one memory cell can be positioned on the memory array, and the minimum unit of a memory failure is the memory cell on the memory array. That is, memory failures include: at least one of a Dimm fault, a Rank fault, a Device fault, a Bank fault, a Row fault/Column fault, a Cell fault, and a Bit fault.
A latch: the circuit is a storage unit circuit sensitive to pulse level, which can change state under the action of specific input pulse level and temporarily store signals to maintain a certain level state.
In the embodiment of the present application, a server is taken as an example of a computing Device, and a conventional fault detection method for a server memory Bank is specifically described with reference to fig. 1, where a processor 1 detects a CE of a memory Bank, and specifically, the processor 1 records corresponding fault information (including a fault address, a fault Dimm, a fault Rank, a Device, a Bank group, a Bank, a row/column, a Cell fault, a bit fault, and other information) in a register 2 of the processor, the register 2 counts the number of memory CEs, and when the number of memory CEs exceeds a threshold, the processor 1 triggers an interrupt, a BIOS chip 3 responds to the interrupt, collects all fault information related to a memory recorded in the register 2 in an interrupt service program, and reports the collected fault information to an out-of-band management system BMC, and the out-of-band management system BMC displays a memory error alarm of a slot corresponding to the fault memory Bank on an interface of the fault information.
However, the above method requires the operation and maintenance personnel to pay attention to the interface of the BMC in real time, and cannot intuitively display the error state of the memory bank. On one hand, if the operation and maintenance personnel do not observe the prompt of the BMC interface to the fault memory bank in time, the fault memory bank cannot be replaced or correspondingly processed in time, and the server system has a downtime risk. On the one hand, because the operation and maintenance personnel can not observe the error state of the memory bank more intuitively, certain risk of replacing the memory bank by mistake exists.
Based on this, the embodiment of the application provides a device for detecting a memory bank fault, and by arranging the detection device, when the processor detects a memory fault, the state indicating member in the detection device sends out an alarm signal, so that the fault state of the memory bank can be observed visually. Fig. 2 is a schematic diagram of an architecture of the memory bank fault detection apparatus provided in this embodiment.
Referring to fig. 2, a memory bank failure detection apparatus 10 according to an embodiment of the present invention includes a memory management controller 13, a latch 11, and a status indicator 12, wherein an input terminal of the memory management controller 13 is connected to a memory bank, an output terminal of the memory management controller is connected to an input terminal of the latch 11, and an output terminal of the latch 11 is connected to the status indicator 12.
The memory management controller 13 is configured to read the CE information of the memory bank and execute related actions.
The memory management controller 13 is configured to detect a Correctable Error CE (Correctable Error) of a memory bank, and specifically, when the memory bank stores data, because a central processing unit inside the server has a certain Error correction capability on the memory data, under a normal condition, when the number of CEs in the memory bank is within a certain range, it may be considered that the data stored in the memory bank is receivable by the server, and at this time, the server does not have a downtime risk; however, when the number of CEs in the memory banks exceeds the preset threshold, it is considered that there is a possibility of generating UCE in the memory banks, and once the UCE is generated in the memory, the server is abnormally restarted or down, so that the operation and maintenance personnel need to find out and replace the memory banks with risk of generating the UCE in time. For example, when the storage capacity of the memory bank is 10 megabit, the preset threshold may be set to 6000, and the preset threshold is a preset value, and may be set in the memory management controller 13 according to the capacity of the specific memory bank.
The memory management controller 13 is configured to, when a CE fault in the memory is triggered due to being accessed, the memory management controller 13 records corresponding fault information (including fault address, fault type (including CE and UCE), dimm of the fault, rank of the fault, device, bank group, bank, row/column, cell fault, bit fault, and the like) in a memory (not shown in the figure) in the memory management controller 13.
A memory may be provided in the memory management controller 13, the memory is used for storing and recording the fault information of the memory, and optionally, the memory may be any one of a register, a DRAM (dynamic random access memory), and an SRAM (static random access memory).
Optionally, the memory may also be disposed in the memory bank 20, and the memory management controller 13 is further configured to directly access the memory in the memory bank to obtain the corresponding memory failure information.
The memory management controller 13 is further configured to count the number of CEs of the memory banks in the memory, compare the number of CEs of the memory banks with a preset threshold, and when the counted number of CEs of the memory banks is greater than the preset threshold, the memory management controller 13 sends a latch signal to the latch 11, so that the latch 11 continuously outputs a high level.
Specifically, the memory management controller 13 is provided with a counter and a processor (not shown in the figure), and the counter is used for counting the number of CEs of memory banks in the memory in response to a command of the processor.
The processor is used for sending a command to the counter to enable the counter to count the number of the memory banks CE, and is also used for comparing the number of the CE counted by the counter with a preset threshold value, clearing the data counted by the counter when the number of the CE counted by the counter is larger than the preset threshold value, and sending a latch signal to the latch 11.
It should be noted that the preset threshold may be preset according to the working environment and the memory capacity of the specific memory bank.
In one embodiment, the memory management controller 13 may include: the BIOS chip 14 and the BMC4, referring to FIG. 3, the input end of the BIOS chip 14 is connected to the memory bank 20, the BIOS chip 14 is connected to the BMC4, the output end of the BMC4 is connected to the input end of the latch 11, and the output end of the latch 11 is connected to the status indicator 12. It should be noted that the BIOS chip 14 and the BMC4 may be the BIOS chip and the BMC of the computing device itself, which may save cost.
The BIOS chip 14 is used for storing a BIOS program, and the BIOS program runs on the BIOS chip 14; the BIOS chip 14 is further configured to obtain failure information in the memory bank 20, and count the number of CEs in the memory bank.
The memory may be disposed in the BIOS chip 14, and the memory is used for storing and recording the fault information of the memory. Alternatively, the memory may be any one of a register, a DRAM (dynamic random access memory), and an SRAM (static random access memory).
Optionally, the memory is disposed on the memory bank 20, and the BIOS chip 14 is further configured to directly access the memory in the memory bank to obtain corresponding memory failure information.
The BMC4 is configured to receive all memory failure information (including the number of CEs of memory banks) reported by the BIOS chip 14, compare the number of CEs of the memory banks counted by the BIOS chip 14 with a preset threshold, and when the number of CEs of the memory banks is greater than the preset threshold, the BMC4 sends an interrupt signal to the BIOS chip 14 and sends a latch signal to the latch 11, so that the latch 11 continuously outputs a high level.
The BIOS chip 14 is further configured to stop a statistical action on the number of CEs in the memory bank and clear the previous statistical data in response to an interrupt signal sent by the BMC 4.
It should be noted that the preset threshold may be specifically set according to the working environment and the memory capacity of the specific memory bank.
In other embodiments, the memory management controller 13 may also be the BIOS chip 14 or the BMC4, and at this time, the BIOS chip 14 or the BMC4 separately performs the obtaining of the failure information of the memory bank, the counting of the CE number of the memory bank, and the comparison between the CE number and the preset threshold value, and sends the latch signal to the latch, and the execution principle is the same as that of the foregoing embodiments.
The latch 11 is configured to continuously output a high level after receiving a latch signal sent by the memory management controller 13, so that the status indicator 12 sends an alarm signal.
In an embodiment, referring to fig. 4, a plurality of memory banks and a plurality of memories may be arranged inside the server, each memory bank is used for storing fault information of a memory bank corresponding to the memory bank, in order to facilitate monitoring health states of different memory banks, a plurality of latches may be arranged, the plurality of latches correspond to the plurality of memory banks one to one, the plurality of latches are connected to a plurality of status indicators, when the number of CE faults of any memory bank counted by the memory management controller 13 is greater than a preset threshold value, the memory management controller 13 searches for a fault memory bank exceeding the threshold value through the memory fault information stored in the memory bank, and sends a latch signal to the latch corresponding to the fault memory bank.
Referring to fig. 5, fig. 5 is a schematic diagram of a detection circuit, in which two input terminals of the latch 11 are respectively connected to VDD and one output terminal of the memory management controller 13, an output terminal of the latch 11 is connected to one end of the status indicator 12, and the other end of the status indicator 12 is grounded to Vss. VDD is a power supply voltage of the latch 11, and VDD corresponds to different parameters according to different types of selections of the latch 11, and optionally, the latch 11 may be selected as an R-S latch, 74L373, or the like, as long as the function of latching levels can be implemented, and the specific type of the latch 11 is not particularly limited herein.
Specifically, in a normal operating state of the memory bank, the output terminal ALERT _ n of the memory management controller 13 connected to the latch 11 maintains a high level signal, at this time, the latch 11 outputs an invalid signal, when the memory management controller 13 detects that the number of CE failures of the memory bank exceeds a preset threshold, it is determined that the memory bank has a failure, at this time, the output terminal ALERT _ n of the memory management controller 13 outputs a low level, and the latch 11 continues to output a high level after receiving the low level signal, so that the status indicator operates to send an alarm signal.
Alternatively, the status indicator 12 may be any one of an indicator light, an LCD display, a nixie tube, a buzzer, or other devices with warning function, and is not limited herein.
Referring to fig. 6, fig. 6 shows the intention of the status indicator installed on the memory bank, in fig. 5, the memory bank 20 is connected to the memory management controller 13 and is detected by the memory management controller 13, and the memory bank 20 includes: memory chip 16, pcb board 17 and pins 18.
The memory chips 16 are used for storing data, the pins 18 are used for providing connection between each memory chip 16 and the memory management controller 13 and other components on the server, and the pins 18 are also used for fixing the memory bank 20 inside the server. The PCB 17 is used to connect the memory chip 16 with the memory management controller 13 and other components of the server, and the PCB 17 is also used to support and fix the memory chip 16.
In an embodiment of the present invention, the status indicator 12 is disposed on the memory bank 20, which facilitates to visually observe the fault information of the memory bank on one hand, and the status indicator 12 is directly disposed on the PCB 17 of the memory bank on the other hand, so that the power supply on the PCB 17 can be directly utilized to provide power for the status indicator 12.
Alternatively, the status indicator 12 may be disposed on a board (not shown) connected to the memory bank 20, as long as the location of the failed memory bank can be conveniently found, and the installation location of the status indicator 12 is not particularly limited herein.
The memory bank detection device 10 provided in the above embodiment can visually observe the memory bank with a specific fault through the alarm signal sent by the status indicator when the memory management controller 13 detects a memory bank fault, and can timely process the memory bank with the fault, thereby avoiding the risk of mistaken replacement and reducing the risk of downtime of the server system.
An embodiment of the present invention further provides a method for detecting a memory bank fault, which is applied to the fault detection apparatus provided in the apparatus embodiment, and the method shown in fig. 7 includes:
s11: the memory failure detection device acquires the failure information of the memory bank and determines the number of correctable error CEs of the memory bank according to the failure information of the memory bank;
s12: under the condition that the number of the correctable errors CE of the memory bank is larger than a preset threshold value, the memory fault detection device determines the fault memory bank exceeding the threshold value and sends a latch signal to a latch corresponding to the fault memory bank; the latch signal is used for: causing the status indicator to emit an alarm signal. In one embodiment, the device for detecting a failure of a memory bank further includes a memory management controller, a latch, and a status indicator, wherein an input terminal of the memory management controller is connected to the memory bank, an output terminal of the memory management controller is connected to an input terminal of the latch, an output terminal of the latch is connected to the status indicator, and the memory management controller executes the steps S11 to S12.
In one implementation, the memory management controller may be selected from any one of a processor, a BIOS chip, a DSP chip, and a CPLD.
In one implementation, the memory management controller may further include a counter and a processor.
In one implementation, the memory management controller may further include a BIOS chip and a baseboard management controller BMC.
The memory bank fault detection apparatus described in the foregoing embodiment can be widely applied to different Computing scenarios, for example, but not limited to, scenarios such as a super computer, an HPC (High Performance Computing), and a dense Computing server.
In an exemplary embodiment, a computer readable storage medium is also provided, which stores at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement all or part of the steps of the above-mentioned memory bank fault detection method. For example, the computer-readable storage medium may be a read-only memory (ROM), a Random Access Memory (RAM), a compact disc-read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises computer instructions, which are stored in a computer readable storage medium. The computer instructions are read by a processor of the computing device from a computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform all or part of the steps of the method shown in any one of the embodiments of fig. 7 described above.
In some embodiments, the methods illustrated in the embodiments of the present application may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture.
In a specific application, the computing device may be a server or a Personal Computer (PC).
It should be understood that other functions of the corresponding computing device constitute non-core points of the invention of the present application and are not described herein in detail. The above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention.

Claims (10)

1. The fault detection device of the memory bank is used in the computing equipment, and is characterized in that the fault detection device comprises a memory management controller and a latch, wherein the input end of the memory management controller is connected with the memory bank, and the output end of the memory management controller is connected with one input end of the latch; the memory management controller is configured to detect correctable errors CE of a memory bank and determine the number of correctable errors CE of the memory bank, and send a latch signal to the latch when the number of correctable errors CE of the memory bank exceeds a preset threshold.
2. The apparatus according to claim 1, wherein the memory management controller comprises a counter and a processor, wherein the counter is configured to count the number of correctable errors CE, and the processor is configured to compare the number of correctable errors CE counted by the counter with the preset threshold, and send a latch signal to the latch if the number of CEs is greater than the preset threshold.
3. The apparatus according to claim 1, wherein the memory management controller includes a BIOS chip and a baseboard management controller BMC, the BIOS chip is configured to obtain fault information of the memory bank, and determine the number of the correctable error CEs according to the fault information of the memory bank, where the fault information of the memory bank includes the number of the correctable error CEs;
the BIOS chip is also used for sending the number of the error-correctable CEs to the baseboard management controller BMC;
the baseboard management controller BMC is configured to receive the number of the correctable erroneous CEs sent by the BIOS chip, compare the number of the correctable erroneous CEs with the preset threshold, and send a latch signal to the latch when the number of the CEs is greater than the preset threshold.
4. The failure detection device according to any one of claims 1 to 3, further comprising a memory, wherein the memory is disposed in the memory bank or the memory management controller, and the memory is configured to store and record failure information of the memory bank.
5. The apparatus according to any of claims 1-4, wherein the number of the memory banks is plural, the number of the latches is plural and corresponds to the number of the memory banks one by one, and in case that the number of the correctable errors CE exceeds the preset threshold value, the memory management controller determines a target failed memory bank exceeding a threshold value and sends a latch signal to the latch corresponding to the target failed memory bank; wherein the target failing memory bank is one or more of the plurality of memory banks.
6. The fault detection device according to any one of claims 1-5, further comprising a status indicator connected to an output of the latch;
the status indicator is to: and receiving the high level output by the latch and sending an alarm signal under the condition that the latch receives a latch signal sent by the memory management controller.
7. The fault detection device according to claim 6, wherein the status indicator is an indicator light and/or a buzzer.
8. The device of any of claims 1-7, wherein the memory stick further comprises a PCB circuit board, the status indicator being disposed on the PCB circuit board or on a slot of the computing device, the slot being configured to mount the memory stick.
9. The fault detection device of any one of claims 1-8, wherein the computing device is a server.
10. A memory bank fault detection method is used in a memory bank fault detection device, the memory bank fault detection device comprises a memory management controller, a latch and a state indicator, the input end of the memory management controller is connected with the memory bank, the output end of the memory management controller is connected with the input end of the latch, and the output end of the latch is connected with the state indicator, and the method is characterized by comprising the following steps:
s11: the memory failure detection device acquires the failure information of the memory bank and determines the number of correctable error CEs of the memory bank according to the failure information of the memory bank;
s12: under the condition that the number of the correctable errors CE of the memory bank is larger than a preset threshold value, the memory fault detection device determines the fault memory bank exceeding the threshold value and sends a latch signal to a latch corresponding to the fault memory bank; the latch signal is used for: causing the status indicator to emit an alarm signal.
CN202211274554.1A 2022-10-18 2022-10-18 Memory bank fault detection device and detection method Pending CN115480947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211274554.1A CN115480947A (en) 2022-10-18 2022-10-18 Memory bank fault detection device and detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211274554.1A CN115480947A (en) 2022-10-18 2022-10-18 Memory bank fault detection device and detection method

Publications (1)

Publication Number Publication Date
CN115480947A true CN115480947A (en) 2022-12-16

Family

ID=84395451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211274554.1A Pending CN115480947A (en) 2022-10-18 2022-10-18 Memory bank fault detection device and detection method

Country Status (1)

Country Link
CN (1) CN115480947A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076186A (en) * 2023-10-17 2023-11-17 苏州元脑智能科技有限公司 Memory fault detection method, system, device, medium and server

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076186A (en) * 2023-10-17 2023-11-17 苏州元脑智能科技有限公司 Memory fault detection method, system, device, medium and server
CN117076186B (en) * 2023-10-17 2024-02-09 苏州元脑智能科技有限公司 Memory fault detection method, system, device, medium and server

Similar Documents

Publication Publication Date Title
US7409594B2 (en) System and method to detect errors and predict potential failures
EP3660681B1 (en) Memory fault detection method and device, and server
US6832329B2 (en) Cache thresholding method, apparatus, and program for predictive reporting of array bit line or driver failures
EP1000395B1 (en) Apparatus and method for memory error detection and error reporting
JP2012113466A (en) Memory controller and information processing system
JP2004537787A (en) Method and apparatus for analyzing power failures in a computer system
WO2017079454A1 (en) Storage error type determination
US6550019B1 (en) Method and apparatus for problem identification during initial program load in a multiprocessor system
JPH1055320A (en) On-line memory monitoring system and device
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
US7266628B2 (en) System and method of retiring events upon device replacement
CN115480947A (en) Memory bank fault detection device and detection method
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN113961478A (en) Memory fault recording method and device
US8161324B2 (en) Analysis result stored on a field replaceable unit
CN113590405A (en) Hard disk error detection method and device, storage medium and electronic device
WO2024082844A1 (en) Fault detection apparatus and detection method for random access memory
CN115509786A (en) Method, device, equipment and medium for reporting fault
CN113742166B (en) Method, device and system for recording logs of server system devices
CN113625957B (en) Method, device and equipment for detecting hard disk faults
US11263083B1 (en) Method and apparatus for selective boot-up in computing devices
US11914703B2 (en) Method and data processing system for detecting a malicious component on an integrated circuit
CN115640174A (en) Memory fault prediction method and system, central processing unit and computing equipment
TWI789983B (en) Power management method and power management device
CN117407207B (en) Memory fault processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination