CN115525467A - Memory fault detection method, device and medium - Google Patents

Memory fault detection method, device and medium Download PDF

Info

Publication number
CN115525467A
CN115525467A CN202211296522.1A CN202211296522A CN115525467A CN 115525467 A CN115525467 A CN 115525467A CN 202211296522 A CN202211296522 A CN 202211296522A CN 115525467 A CN115525467 A CN 115525467A
Authority
CN
China
Prior art keywords
memory
fault
information
reason
failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211296522.1A
Other languages
Chinese (zh)
Inventor
李文佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211296522.1A priority Critical patent/CN115525467A/en
Publication of CN115525467A publication Critical patent/CN115525467A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing

Abstract

The application discloses a memory fault detection method, a device and a medium thereof, relates to the technical field of servers, is used for positioning a fault memory and detecting a fault reason, and aims at solving the problems that the existing determination method for the fault reason is complicated and is not easy to implement. Aiming at the memory information, the operation condition of each memory can be determined through the data at the specific position, so that the fault reason can be determined, different fault information corresponding to the memories is returned to inform operation and maintenance personnel of the fault condition of the memories, the operation and maintenance personnel do not need to determine the fault reason in a mode of manually calling a BIOS log, the memory fault detection process is greatly simplified, the requirement on the operation and maintenance personnel is smaller, and the actual implementation is facilitated.

Description

Memory fault detection method, device and medium
Technical Field
The present application relates to the field of server technologies, and in particular, to a method, an apparatus, and a medium for detecting a memory fault.
Background
A memory (memory), also called an internal memory, is mainly used for temporarily storing operation data in a Central Processing Unit (CPU) and data exchanged with an external memory such as a hard disk, and is one of important components in a server. The memory also determines whether the server can operate stably, and if the memory fails, the server can be started without displaying or down and other problems. Therefore, when the memory fails, the failure cause is quickly found, and the failed memory is quickly positioned, which is especially important for maintenance personnel to timely solve memory failure and guarantee normal operation of the server.
At present, a light-emitting diode (LED) is disposed near each memory card slot on a motherboard of a server, and is used to indicate whether a memory of the corresponding card slot has a fault, and when an Operating System (OS) of the server detects that the memory is unavailable, an error is reported, and an LED lamp at the corresponding memory card slot is controlled to be normally lit to perform fault location. However, this method cannot distinguish the specific failure cause of the memory. In order to determine the cause of the fault, at present, a server is generally powered off, a CMOS is cleaned in a cap-skipping manner, then the server is powered on and started, a Basic Input Output System (BIOS) setting mode, that is, a BIOS Setup, is entered, and operation and maintenance personnel check memory information to locate the memory fault and determine the cause of the fault.
CMOS: is an abbreviation of Complementary Metal Oxide Semiconductor (CMOS). It is a technology for manufacturing large scale integrated circuit chips or chips manufactured by the technology, and is a Random Access Memory (RAM) chip on a computer mainboard. Because of the readable and writable property, the BIOS is used for storing the data after the BIOS sets the hardware parameters of the computer on the computer mainboard.
Therefore, a need exists in the art for a method for detecting a memory fault, which solves the problem that the method for determining the cause of the fault is complicated and is not easy to implement when detecting the memory fault.
Disclosure of Invention
The present application aims to provide a memory fault detection method, device and medium thereof, so as to solve the problem that the determination method for the fault reason is complicated and is not easy to implement when the memory fault detection is performed at present.
In order to solve the above technical problem, the present application provides a memory fault detection method, where there is a communication connection between the BMC and the BIOS, including:
obtaining memory information obtained after BIOS self-checking;
positioning a fault memory according to the memory information and judging the fault reason;
and returning corresponding fault information according to the fault memory and the fault reason.
Preferably, the failure causes include memory corruption, memory not in place, and memory not trained.
Preferably, a memory fault lamp connected with the BMC is arranged at each memory card slot;
correspondingly, the step of returning corresponding fault information according to the fault memory and the fault reason comprises the following steps:
if the failure reason is that the memory is damaged, controlling a memory failure lamp corresponding to the failed memory to flash in a first flashing mode;
if the fault reason is that the memory is not in place, controlling a memory fault lamp corresponding to the fault memory to flash in a second flash mode;
if the fault reason is that the memory fails to pass the training, controlling a memory fault lamp corresponding to the fault memory to flash in a third flashing mode; wherein the first blinking manner, the second blinking manner, and the third blinking manner are different.
Preferably, the server comprises a fault reason indicator lamp connected with the BMC and fault positioning lamps arranged at the memory card slots, wherein the fault reason indicator lamp has at least three different colors and respectively corresponds to different fault reasons;
correspondingly, the step of returning corresponding fault information according to the fault memory and the fault reason comprises the following steps:
and controlling a fault positioning lamp corresponding to the fault memory to be normally on, and controlling a fault reason indicating lamp of a corresponding color to be normally on according to the fault reason.
Preferably, memory fault lamps connected with the BMC are arranged at the memory card slots in groups respectively, and each memory fault lamp of each group has at least three different colors and corresponds to different fault reasons respectively;
correspondingly, the step of returning corresponding fault information according to the fault memory and the fault reason comprises the following steps:
and controlling the memory fault lamps of the color corresponding to the memory fault reason in the memory fault lamp group corresponding to the fault memory to be always on.
Preferably, the method further comprises the following steps:
and when the fault memory is judged to exist, sending the corresponding memory information to the memory for storage.
Preferably, sending the corresponding memory information to the memory for storage includes:
if the fault reason corresponding to the fault memory is that the memory is damaged, storing the memory information into the first memory;
if the failure reason corresponding to the failure memory is that the memory is not in place, the memory information is stored in the second memory;
and if the fault reason corresponding to the fault memory is that the memory is not trained, storing the memory information into a third memory.
In order to solve the above technical problem, the present application further provides a memory fault detection device, including:
the information acquisition module is used for acquiring the memory information obtained after the BIOS self-inspection;
the fault judgment module is used for positioning a fault memory according to the memory information and judging the fault reason;
and the fault return module is used for returning corresponding fault information according to the fault memory and the fault reason.
Preferably, the method further comprises the following steps:
and the storage module is used for sending the corresponding memory information to a memory for storage when judging that the fault memory exists.
In order to solve the above technical problem, the present application further provides a memory fault detection device, including:
a memory for storing a computer program;
the processor is configured to implement the steps of the memory failure detection method when executing the computer program.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the memory fault detection method are implemented.
According to the memory fault detection method, after the BMC of the server is in communication connection with the BIOS, the self-checking information of the BIOS after self-checking can be obtained. According to the memory information, the operation condition of each memory can be determined through the data at the specific position, so that the fault reason can be determined, different fault information corresponding to the memories can be returned to inform operation and maintenance personnel of the fault condition of the memories, the operation and maintenance personnel do not need to determine the fault reason in a mode of manually calling a BIOS log, the memory fault detection process is greatly simplified, the requirement on the operation and maintenance personnel is lower, and the practical implementation is facilitated.
The memory fault detection device and the computer readable storage medium provided by the application correspond to the method, and the effects are the same as those of the method.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings required for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a memory fault detection method according to the present invention;
FIG. 2 is a block diagram of a memory failure detection circuit according to the present invention;
FIG. 3 is a flowchart of another memory fault detection method provided by the present invention;
FIG. 4 is a flow chart of a memory fault detection method incorporating the circuit of FIG. 2 according to the present invention;
FIG. 5 is a diagram illustrating a memory failure detection apparatus according to the present invention;
fig. 6 is a structural diagram of another memory failure detection apparatus provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a memory fault detection method, a device and a medium thereof.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.
In the practical application of the current server, the server is usually configured with a plurality of memories, when an operating system of the server detects that a memory is unavailable, an error message is returned, and further, an indicator lamp arranged at a memory card slot is controlled to be powered on to emit light. However, the operating system can only know which memory is unavailable, and cannot know the reason why the memory is unavailable, and at present, the specific reason for the memory failure is mostly determined by inquiring the memory information of the BIOS.
For BIOS, it is a set of programs that are fixed on a Read-Only Memory (ROM) chip on the motherboard of a computer, and its main function is to provide the lowest level, most direct hardware setup and control for the computer. It stores the most important basic input and output program of computer, self-checking program after starting up and system self-starting program, it can read and write the specific information set by system from CMOS, in which the setting information about the server memory is the above-mentioned memory information. The operation and maintenance personnel of the server can firstly power off the server, clear the CMOS in a cap-jumping mode, enter the BIOS Setup, and perform fault memory positioning and fault reason determination by checking memory information.
Because the requirement on the operation and maintenance personnel is relatively high, the operation and maintenance personnel needs to know the cap jump and the BIOS to a certain extent, the steps are relatively complicated, the efficiency is low, and the quick recovery of the server memory is not facilitated, so the memory fault detection method is provided by the application, and is applied to a server with communication connection between a base plate Management Controller (BMC) and the BIOS, and comprises the following steps:
s11: and obtaining the memory information obtained after the BIOS self-inspection.
Due to the characteristics of the BIOS, the BIOS can perform self-check after the server is powered on, wherein the information about the memory part in the boot information obtained by the self-check, namely the memory information, can reflect the current memory condition of the server.
S12: and positioning the fault memory according to the memory information and judging the fault reason.
S13: and returning corresponding fault information according to the fault memory and the fault reason.
The fault information includes the location information of the fault memory and the fault reason, so that operation and maintenance personnel can locate the fault memory and distinguish different memory fault reasons.
Further, the fault cause that can be determined according to the memory information includes, but is not limited to: memory corruption, memory not in place, and memory not trained.
It is easy to understand that, the above method for determining the cause of the memory failure according to the memory information is not essentially different from the method for manually determining the cause of the memory failure by acquiring the memory information by the operation and maintenance staff at present, and the determination is performed by data on a specific data bit of the memory information. For example, when the memory is not trained, no data exists in the corresponding memory information data bit, and when the memory is not trained, the corresponding memory information data bit displays corresponding abnormal information.
Similarly, the positioning information of the faulty memory may also be determined according to the data bits corresponding to the memory information, and the faulty memory may be positioned according to the corresponding relationship between the memory and the abnormal data bits of the memory information, which is not described in this embodiment again.
It should be noted that three memory failure reasons are provided, but the present application is not limited to only detecting the three memory failure reasons, and the above is only a preferred embodiment, because the three memory failure reasons are enough to cover most of the memory failure scenarios, in practical application, a more detailed determination of the memory failure reasons may also be performed according to the memory information acquired by the BMC.
It should be further noted that the memory fault detection method provided by the present application needs to be applied to a server having a communication connection between the BMC and the BIOS, because the BMC and the BIOS do not have a communication connection directly used for transmitting memory information at present, and an Intelligent Platform Management Interface (IPMI) is used as an interactive standard Management specification that is conventionally used by a server system cluster, and can be used for implementing the BMC and other components. For example, the interaction of a BIOS/Unified Extensible Firmware Interface (UEFI), a CPU, and the like, monitors physical health characteristics of the server, such as temperature, voltage, fan operating state, power state, and the like, and implements server system cluster autonomy. Therefore, the communication connection of the memory information from the BIOS to the BMC is preferably realized through IPMI.
In addition, as for the fault information, the present application is not limited to the representation form of the fault information, and may be determined according to the corresponding implementation subject. For example, if the fault information is fed back to the operation and maintenance personnel through the LED lamp, the fault information should be a light signal or a control signal for controlling on and off of different LED lamps. Similarly, the fault information can also be a sound signal used by matching with a buzzer, and a text signal, an image signal and the like used by matching with a management platform, a mobile terminal and the like for unified management of the operation and maintenance personnel on the server.
According to the memory fault detection method provided by the application, the BMC can acquire the memory information of the BIOS through communication connection established in modes of IPMI and the like, and then positioning of a fault memory and determination of a memory fault reason are carried out according to the memory information.
Considering that in the existing server, a memory fault display lamp (i.e., an LED lamp) is disposed at each memory card slot, it is described that the optical signal prompt implemented by the LED lamp is easily accepted by those skilled in the art, and the fault memory is easily located while the implementation is simple and the cost is low, so this embodiment provides a possible implementation manner:
the memory fault lamps connected with the BMC are arranged at the memory card slots in groups respectively, each memory fault lamp of each group has at least three different colors and corresponds to different fault reasons respectively, in a possible application scene, one group of memory fault lamps has three different colors, wherein memory damage corresponds to a red LED lamp, the memory is not in place corresponds to a yellow LED lamp, and the memory is not trained and corresponds to a blue LED lamp.
Correspondingly, the step of returning corresponding fault information according to the fault memory and the fault reason comprises the following steps:
and controlling the memory fault lamps of the color corresponding to the memory fault reason in the memory fault lamp group corresponding to the fault memory to be always on.
It can be known from the above that, in this embodiment, the number of the memory fault display lamps arranged at each memory card slot is substantially expanded, each memory card slot corresponds to three different colors of memory fault display lamps (red, yellow, and blue), and different fault reasons (red-memory damage, yellow-memory absence, and blue-memory failure training) are visually displayed to the operation and maintenance staff through the on and off of the three different colors of LED lamps.
Therefore, in the application scenario of the present embodiment, the fault information is also different colors of light emitted by the different color LED lamps.
The mode that the LEDs with different colors are used for distinguishing different fault reasons of the memory has the advantages that the mode is visual and convenient, and operation and maintenance personnel can quickly locate the fault memory and determine the fault reasons only by checking the color and the position of the fault lamp of the memory which emits light.
However, the above method needs to modify the server to a certain extent, needs to increase the number of the memory fault lamps and add related circuits, and although the fault location and the reason display are very intuitive, the server is also changed greatly. Based on this, the present embodiment also provides another preferred scheme that utilizes the existing single memory failure lamp (i.e. the LED lamp, which is generally red, is disposed at the memory card slot when the memory fails):
as shown in fig. 2, the server includes: BIOS21, BMC22, fault location lamp module 23, and fault cause indicator lamp module 24;
the BIOS21 and the BMC22 are in communication connection through IPMI; the fault locating lamp module 23 is also an LED lamp arranged at each memory card slot in the existing server, is used for locating a fault memory, is connected with the BMC22, and is controlled by the BMC22 in an on-off state; the fault reason indicator lamp module 24 can be arranged at any position of the server, so that operation and maintenance personnel can conveniently check the fault reason indicator lamp module, at least three different colors (red: LED-R, yellow: LED-Y and blue: LED-B) exist, and the three colors respectively correspond to different fault reasons; the BMC22 is connected to each switch of the fault cause indicator module 24, and is configured to control the on/off of the switch, so that the fault cause indicator module 24 lights an LED lamp with a specified color to indicate a fault cause.
Based on the circuit structure shown in fig. 2, after the BMC22 determines the memory with the fault and the fault reason through the memory information acquired from the BIOS21, the fault location module 23 is controlled to light up the LED lamp disposed at the memory slot with the fault to perform fault location, and the fault reason indicator module 24 is controlled to light up the LED lamps (red, yellow, and blue) with the designated colors to indicate the fault reason of the current memory.
Compared with the scheme provided by the embodiment, the preferred scheme provided by the embodiment has the advantages that the change of the server is smaller, the implementation is easier, and the effect of intuitively and conveniently showing the current memory fault position and reason to operation and maintenance personnel can be still met.
The above two preferable schemes need to be expanded more or less to the LED lamp used by the existing server for displaying the memory fault, so this embodiment further provides a preferable scheme that the indication of the location of the faulty memory and the cause of the fault can be realized only by using the LED lamp arranged at each memory card slot in the existing server, including:
the step of returning corresponding fault information according to the fault memory and the fault reason comprises the following steps:
if the failure reason is that the memory is damaged, controlling a memory failure lamp corresponding to the failed memory to flash in a first flashing mode;
if the fault reason is that the memory is not in place, controlling a memory fault lamp corresponding to the fault memory to flash in a second flash mode;
if the fault reason is that the memory fails to pass the training, controlling a memory fault lamp corresponding to the fault memory to flash in a third flashing mode; wherein the first blinking manner, the second blinking manner, and the third blinking manner are different.
It is easy to know that the flashing mode can be any combination of different flashing states of the LEDs, such as different flashing frequencies, time for each flashing, and the like, and any number of different flashing modes can be obtained. In practical application, three appropriate and different flashing modes can be selected as the first flashing mode, the second flashing mode and the third flashing mode according to actual needs, the greater the difference of the three flashing modes is, the more the accuracy and timeliness of the fault reason can be obtained by operation and maintenance personnel, and the requirement for the operation and maintenance personnel to memorize the corresponding relationship between the different flashing modes and the fault reason is reduced.
In addition, the fault memory is positioned by the position of the memory card slot where the flickering LED lamp is positioned, which is the same as the existing scheme.
The advantages of a preferred scheme provided by the embodiment are as follows: the structure of the existing server does not need to be changed, the determination of the fault reason can be realized by utilizing the original fault display lamp, the realization is simple, meanwhile, the method is more favorable for being directly applied to the server which is delivered from a factory or put into use, and the universality of the method is improved.
In practical applications, after a specific memory failure cause is determined and returned by the memory failure detection method disclosed in the embodiment, application scenarios may occur, such as that an operation and maintenance worker is not near a server and that the operation and maintenance worker subsequently has an analysis requirement on the overall working condition of the server for a period of time, and the embodiment provides a preferred implementation scheme for the problems faced by the application scenarios, and as shown in fig. 3, the method further includes:
s14: and when the fault memory is judged to exist, sending the corresponding memory information to the memory for storage.
According to the embodiment, the reason for determining the memory fault according to the memory information is mainly determined according to the information on the specific data bit, so that after the memory fault is determined, the information on the corresponding specific data bit can be copied and sent to the memory for storage, and preferably, the information such as the positioning information and the fault time can be matched, so that operation and maintenance personnel can better and conveniently perform subsequent calling for server memory fault analysis.
Further, on the basis of the foregoing example, this example further provides a preferred implementation, and the step S14 further specifically includes:
s14-a: and if the fault reason corresponding to the fault memory is that the memory is damaged, storing the memory information into the first memory.
S14-b: and if the failure reason corresponding to the failure memory is that the memory is not in place, storing the memory information into the second memory.
S14-c: and if the fault reason corresponding to the fault memory is that the memory is not trained, storing the memory information into a third memory.
The first storage, the second storage and the third storage are different, namely, the memory information corresponding to different failure reasons is stored separately, and when operation and maintenance personnel need to perform failure analysis, the memory information of the specified type can be obtained quickly.
In addition, for the three storages, the operation and maintenance personnel can also quickly locate the storages by arranging the LED lamps and adopting the same expression form as that of different internal storage fault operations.
Illustratively, a red LED lamp is disposed at the first storage, a yellow LED lamp is disposed at the second storage, and a blue LED lamp is disposed at the third storage. Or the LED lamp arranged at the first storage flickers in a first flickering mode, the LED lamp arranged at the second storage flickers in a second flickering mode, and the LED lamp arranged at the third storage flickers in a third flickering mode.
In addition, the memory information corresponding to the failed memory is retained through the preferred scheme disclosed by the embodiment so as to be called by subsequent operation and maintenance personnel for analysis.
Therefore, on the basis of the above embodiment, the memory information stored in the memory can be deleted according to a certain rule by setting the data clearing logic, and the memory resource can be released in time.
It is easy to know that after the memory is accessed and called by an operation and maintenance person through external equipment, the memory information is obtained and can be regarded as overdue data, and further, the memory data can be deleted as a cleaning logic to release the memory space, so that the subsequently generated fault memory information can be conveniently stored, the release of the storage resource of the memory is ensured to a certain extent, and the memory is beneficial to providing a storage function for a long time.
In addition, in another possible implementation, the memory information whose storage time exceeds the preset threshold may also be regarded as the expired data, and the deletion operation performed on the memory information releases the storage space of the memory.
Further, in the above embodiment, when the external device accesses the memory, it is preferable to perform authentication on the external device that sends the access request, so as to avoid the problem of false deletion caused by the above-mentioned cleaning logic for deleting the accessed memory information while ensuring data security.
In this embodiment, the implementation manner of the identity verification is not limited, and the trusted external device should hold the pre-assigned identity token and perform the identity verification before accessing the storage, generally speaking, only the trusted faulty device that is actually calling the memory information may pass the identity verification, so that the security of the memory information stored in the storage is ensured, and the data is prevented from being deleted by mistake due to access of other devices.
According to the preferred scheme provided by the embodiment, the memory information corresponding to the fault memory is sent to the memory to be stored, so that operation and maintenance personnel can analyze the memory fault and manage the server conveniently. Furthermore, the memory information is separately stored according to different fault reasons, so that operation and maintenance personnel can conveniently obtain the required memory information. In addition, after the operation and maintenance personnel access the memory information stored in the memory through the external equipment, the expired data can be cleared, the storage space is released, and the persistent application of the memory is facilitated. Meanwhile, in order to ensure the safety of data, the safety verification is carried out on external equipment for accessing the memory, and the memory information in the memory can be accessed only through the verified equipment, so that the safety is ensured, and the mistaken deletion of the memory information is avoided.
To more clearly and specifically describe the memory failure detection method provided in the present application, the following description is further made with reference to an example and the circuit structure shown in fig. 2, where the method is shown in fig. 4 and includes:
s41: and the BIOS performs self-checking to acquire the memory information of the server.
S42: the BMC acquires the memory information of the BIOS through the IPMI.
S43: and the BMC positions the fault memory according to the memory information and controls the fault positioning lamp module to light up the LED lamp corresponding to the fault memory.
S44: the BMC determines whether the memory is damaged according to the memory information, and if so, goes to step S45.
S45: the BMC controls the fault reason indicator lamp module to light a red lamp.
S46: the BMC determines whether the memory is in place according to the memory information, and if not, goes to step S47.
S47: the BMC controls the fault reason indicating lamp module to light a yellow lamp.
S48: the BMC determines whether the memory passes the training according to the memory information, and if not, goes to step S49.
S49: the BMC controls the fault reason indicator lamp module to light a blue lamp.
Here, the determination of the different causes of failure is parallel, so steps S44, S46 and S48 are also performed in parallel.
Since the present embodiment is only an implementation manner of a combination example, the beneficial effects and principles are the same as those of the above embodiments, and thus are not described herein again.
In the foregoing embodiment, a memory fault detection method is described in detail, and the present application also provides an embodiment corresponding to a memory fault detection apparatus. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one is from the perspective of the function module, and the other is from the perspective of the hardware.
Based on the angle of the functional module, this embodiment provides a memory fault detection apparatus, as shown in fig. 5, including:
an information obtaining module 51, configured to obtain memory information obtained after BIOS self-inspection;
the fault judgment module 52 is configured to locate a fault memory according to the memory information and judge a fault reason;
and a failure returning module 53, configured to return corresponding failure information according to the failure memory and the failure reason.
Preferably, the method further comprises the following steps:
and the storage module is used for sending the corresponding memory information to a memory for storage when judging that the fault memory exists.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
In the memory fault detection device provided by this embodiment, the information acquisition module acquires the memory information obtained after the BIOS performs self-inspection; and then judge the fault reason that the memory breaks down according to the memory information through the fault judge module, return the corresponding fault information according to trouble memory and fault reason through the fault return module finally, thus realize the definite determination of the memory fault reason, compare with the mode that at present the manual work obtains the memory information and analyzes the fault reason from BIOS, the definite process of the fault reason is simpler, and the display of the fault reason is also more visual for the operation and maintenance personnel, the operation and maintenance personnel can fix a position the fault memory rapidly through the fault information that returns, clarify the fault reason, also reduced the requirement to the operation and maintenance personnel at the same time, more laminate to the practical application needs, help to maintain the smooth operation of the server.
Fig. 6 is a structural diagram of a memory failure detection apparatus according to another embodiment of the present application, and as shown in fig. 6, the memory failure detection apparatus includes: a memory 60 for storing a computer program;
the processor 61 is configured to implement the steps of the memory failure detection method according to the above embodiment when executing the computer program.
The memory fault detection apparatus provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 61 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 61 may be implemented in hardware using at least one of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 61 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 61 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 61 may also include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
Memory 60 may include one or more computer-readable storage media, which may be non-transitory. Memory 60 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 60 is at least used for storing a computer program 601, wherein after being loaded and executed by the processor 61, the computer program can implement the relevant steps of a memory failure detection method disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 60 may also include an operating system 602, data 603, and the like, which may be in a transient or persistent form. Operating system 602 may include Windows, unix, linux, etc., among others. Data 603 may include, but is not limited to, a memory failure detection method, etc.
In some embodiments, a memory failure detection apparatus may further include a display 62, an input/output interface 63, a communication interface 64, a power supply 65, and a communication bus 66.
Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of a memory failure detection arrangement and may include more or fewer components than those shown.
The memory fault detection device provided by the embodiment of the application comprises a memory and a processor, wherein when the processor executes a program stored in the memory, the following method can be realized: a memory failure detection method.
In the memory fault detection apparatus provided in this embodiment, the processor executes the computer program stored in the memory, so as to obtain the memory information obtained after the BIOS performs self-test; and judging whether a memory fails according to the memory information, if so, positioning the failed memory and acquiring the failure reason, and finally returning corresponding failure information to prompt operation and maintenance personnel. Compared with the mode of manually acquiring the memory information from the BIOS and analyzing the fault reason at present, the device is simpler and more visual, the operation and maintenance personnel can quickly locate the fault memory and determine the fault reason through the returned fault information, meanwhile, the requirement on the operation and maintenance personnel is also reduced, the device is more suitable for the actual application requirement, and the stable operation of the server is favorably maintained.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In the computer-readable storage medium provided by this embodiment, when the computer program stored in the computer-readable storage medium is executed, the memory information obtained after the BIOS performs self-test can be obtained; and judging whether a memory fault exists according to the memory information, if so, positioning the fault memory and determining the fault reason according to the specific data bit of the memory information. The memory fault detection process does not need to manually acquire and analyze memory information, can visually locate the fault memory and determine the fault reason through the returned fault information, is simpler in process and higher in efficiency compared with the current manual analysis mode, has lower requirements on operation and maintenance personnel, is easier to apply to actual engineering, and maintains stable operation of the server.
The memory fault detection method, the memory fault detection device and the memory fault detection medium provided by the application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A memory fault detection method is characterized in that communication connection exists between BMC and BIOS, and comprises the following steps:
obtaining memory information obtained after BIOS self-checking;
positioning a fault memory according to the memory information and judging the fault reason;
and returning corresponding fault information according to the fault memory and the fault reason.
2. The method according to claim 1, wherein the failure cause comprises memory corruption, memory not in place, and memory not trained.
3. The memory fault detection method according to claim 2, wherein a memory fault lamp connected to the BMC is respectively disposed at each memory card slot;
correspondingly, the returning of the corresponding fault information according to the fault memory and the fault reason includes:
if the fault reason is that the memory is damaged, controlling the memory fault lamp corresponding to the fault memory to flash in a first flashing mode;
if the fault reason is that the memory is not in place, controlling the memory fault lamp corresponding to the fault memory to flash in a second flash mode;
if the failure reason is that the memory fails to be trained, controlling the memory failure lamp corresponding to the failure memory to flash in a third flash mode; wherein the first, second, and third blinking patterns are different.
4. The memory fault detection method according to claim 2, wherein a server includes a fault cause indicator lamp connected to the BMC and fault location lamps disposed at each memory card slot, and the fault cause indicator lamp has at least three different colors and corresponds to different fault causes respectively;
correspondingly, the returning of the corresponding fault information according to the fault memory and the fault reason includes:
and controlling the fault positioning lamp corresponding to the fault memory to be normally on, and controlling the fault reason indicating lamp of the corresponding color to be normally on according to the fault reason.
5. The memory fault detection method according to claim 2, wherein memory fault lamps connected with the BMC are respectively arranged in groups at each memory card slot, and each memory fault lamp of each group has at least three different colors, respectively corresponding to different fault reasons;
correspondingly, the step of returning the corresponding fault information according to the fault memory and the fault reason comprises the following steps:
and controlling the memory fault lamps of the color corresponding to the memory fault reason in the memory fault lamp group corresponding to the fault memory to be always on.
6. The memory failure detection method according to claim 2, further comprising:
and when the fault memory is judged to exist, sending the corresponding memory information to a memory for storage.
7. The method according to claim 6, wherein the sending the corresponding memory information to a memory for storage comprises:
if the fault reason corresponding to the fault memory is that the memory is damaged, storing the memory information into a first memory;
if the failure reason corresponding to the failure memory is that the memory is not in place, the memory information is stored in a second memory;
and if the fault reason corresponding to the fault memory is that the memory is not trained, storing the memory information into a third memory.
8. A memory failure detection apparatus, comprising:
the information acquisition module is used for acquiring the memory information obtained after the BIOS self-inspection;
the fault judgment module is used for positioning a fault memory according to the memory information and judging the fault reason;
and the fault return module is used for returning corresponding fault information according to the fault memory and the fault reason.
9. A memory failure detection apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the memory failure detection method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the memory failure detection method according to any one of claims 1 to 7.
CN202211296522.1A 2022-10-21 2022-10-21 Memory fault detection method, device and medium Pending CN115525467A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211296522.1A CN115525467A (en) 2022-10-21 2022-10-21 Memory fault detection method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211296522.1A CN115525467A (en) 2022-10-21 2022-10-21 Memory fault detection method, device and medium

Publications (1)

Publication Number Publication Date
CN115525467A true CN115525467A (en) 2022-12-27

Family

ID=84704204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211296522.1A Pending CN115525467A (en) 2022-10-21 2022-10-21 Memory fault detection method, device and medium

Country Status (1)

Country Link
CN (1) CN115525467A (en)

Similar Documents

Publication Publication Date Title
US7502669B1 (en) Apparatus and method for graphically displaying disk drive enclosures and cabling in a data storage system
CN103473167A (en) Fault display method and device of server
CN113672306B (en) Server component self-checking abnormity recovery method, device, system and medium
CN112654119B (en) LED drive test method, driver, system and electronic equipment
CN111352787B (en) GPU topology connection detection method, device, equipment and storage medium
CN115525467A (en) Memory fault detection method, device and medium
CN116010141A (en) Method, device and medium for positioning starting abnormality of multipath server
CN115098294B (en) Abnormal event processing method, electronic equipment and management terminal
CN111124785A (en) Hard disk fault checking method, device, equipment and storage medium
US10282948B2 (en) Device for indicating a datacenter rack among a plurality of datacenter racks
CN110955565A (en) Server and error detection method thereof
CN115098342A (en) System log collection method, system, terminal and storage medium
CN113835971A (en) Monitoring method for abnormal lighting of server backboard and related components
CN116149889A (en) Fault positioning method, device and system and computer readable storage medium
CN113297864B (en) Address setting device and method for entrance guard card reader
CN114327986B (en) FRB2WDT timeout time determining method, device, equipment and medium
CN111796991B (en) System status indication method, apparatus, computer device, and readable storage medium
CN114356824B (en) vpx blade node, state monitoring method, device and storage medium
US11921665B2 (en) Server system and method for detecting correctness of connections therein
US10997012B2 (en) Identifying defective field-replaceable units that include multi-page, non-volatile memory devices
CN114297039A (en) Display method, display device, electronic equipment and storage medium
CN114443446A (en) Hard disk indicator lamp control method, system, terminal and storage medium
CN113746680A (en) Physical position determining method and device testing method and device
CN113553224A (en) Method, device and equipment for detecting basic functions of baseboard management controller based on mainboard
CN116627360A (en) Data issuing method, data management system, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination