CN117234771A

CN117234771A - Fault memory positioning method, system, device, computer equipment and storage medium

Info

Publication number: CN117234771A
Application number: CN202311157880.9A
Authority: CN
Inventors: 张瑜
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-15

Abstract

The invention relates to the technical field of servers, and discloses a fault memory positioning method, a system, a device, computer equipment and a storage medium, wherein the method comprises the following steps: under the condition that a memory fault signal is received and the server is in a shutdown state, the write-protection state of the power management chip register is released; determining candidate fault memories according to the memory fault signals; writing a disable command into a power management chip register of the candidate fault memory, wherein the disable command is used for not powering on the candidate fault memory when the server enters a starting state; under the condition that the server enters a starting state, acquiring log information of candidate fault memories; and determining the target fault memory and the slot position information of the target fault memory from the candidate fault memories according to the log information. The invention solves the problems that the abnormal information which can accurately locate the fault memory is difficult to obtain and the PMIC fault memory cannot be located.

Description

Fault memory positioning method, system, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of server technologies, and in particular, to a method, a system, an apparatus, a computer device, and a storage medium for locating a fault memory.

Background

In order to meet the increasing performance requirements of the server, a high-performance CPU (Central Processing Unit ) is adopted in the server, and the high-performance CPU requires a large amount of memory, so that the probability of memory failure in the server is relatively high.

There are many kinds of faults of the memory, in which the memory PMIC (Power Management Integrated Circuit, power management chip) is abnormal, which can cause the server to be shut down, and once the fault occurs, the server cannot be used normally. The existing method for determining the PMIC fault memory cannot directly locate the memory position of the PMIC abnormality, and meanwhile, the server is shut down due to the occurrence of the PMIC abnormality, so that the abnormal information capable of accurately locating the fault memory is difficult to obtain. The server cannot autonomously locate the PMIC fault memory, and cannot autonomously recover the PMIC fault memory, so that the problem of PMIC abnormality is solved, the aging is long, and great challenges are brought to maintenance of the server.

Therefore, the prior art has the problems that abnormal information capable of accurately positioning the fault memory is difficult to obtain and the PMIC fault memory cannot be positioned.

Disclosure of Invention

In view of the above, the present invention provides a method, a system, a device, a computer device and a storage medium for locating a faulty memory, so as to solve the problem that in the prior art, it is difficult to obtain abnormal information capable of accurately locating the faulty memory, and it is impossible to locate the faulty memory of the PMIC.

In a first aspect, the present invention provides a method for locating a failed memory, where the method includes:

under the condition that a memory fault signal is received and the server is in a shutdown state, the write-protection state of the power management chip register is released;

determining candidate fault memories according to the memory fault signals;

writing a disable command into a power management chip register of the candidate fault memory, wherein the disable command is used for not powering on the candidate fault memory when the server enters a starting state;

under the condition that the server enters a starting state, acquiring log information of candidate fault memories;

and determining the target fault memory and the slot position information of the target fault memory from the candidate fault memories according to the log information.

According to the fault memory positioning method provided by the embodiment, under the condition that a memory fault signal is received and the server is in a shutdown state, the write protection state of the power management chip register is relieved, and candidate writing of a disable command into a candidate fault memory is facilitated. And determining the candidate fault memory according to the memory fault signal, and writing a disable command into a power management chip register of the candidate fault memory to avoid the influence of the candidate fault memory on the startup of the server. Under the condition that the server enters a starting state, acquiring log information of the candidate fault memory, determining target fault memory and slot position information according to the log information, and directly positioning the specific memory slot position of the target fault memory. The method solves the problem that abnormal information capable of accurately positioning the fault memory is difficult to obtain and the PMIC fault memory cannot be positioned.

In an alternative embodiment, determining the target fault memory and the slot information of the target fault memory from the candidate fault memories according to the log information includes:

judging whether first type memory error reporting information exists in the log information;

under the condition that the first type of memory error reporting information exists, determining a candidate fault memory corresponding to the first type of memory error reporting information as a target fault memory, and determining slot position information of the target fault memory according to log information.

In the embodiment, the candidate fault memory corresponding to the first type of memory error reporting information in the log information is determined as the target fault memory, and the slot position information of the target fault memory is determined.

In an alternative embodiment, after determining the slot information of the target fault memory according to the log information, the method further includes:

taking a candidate fault memory corresponding to the second type memory error reporting information in the log information as a normal memory;

and taking the candidate fault memory corresponding to the third type of memory error reporting information in the log information as a fault memory to be analyzed, wherein the fault memory to be analyzed is a candidate fault memory except the target fault memory and the normal memory.

In this embodiment, the candidate fault memory corresponding to the second type of memory error reporting information in the log information is used as a normal memory, and the candidate fault memory corresponding to the third type of memory error reporting information is used as a fault memory to be analyzed, so that the invention can determine multiple types of fault memories, and the application range of the invention is expanded.

In an alternative embodiment, releasing the write-protected state of the power management chip register includes:

controlling a target main board to perform alternating current and low current;

after the preset time, the target main board is controlled to carry out alternating current power-on, and the write-protection state of the power management chip register is released.

In this embodiment, the target motherboard is controlled to perform ac power down, and after a preset time elapses, the target motherboard is controlled to perform ac power up, so as to complete the release of the write-protection state of the power management chip register, so that the disable command is conveniently written into the power management chip register of the candidate fault memory subsequently, and the influence of the candidate fault memory on the normal startup of the server is avoided.

In a second aspect, the present invention provides a fault memory location system, the system comprising: a complex programmable device, a baseboard management controller, a platform path controller, and a central processing unit;

The complex programmable device is used for receiving the memory fault signal and determining whether the server is in a shutdown state or not;

the complex programmable device is connected with the baseboard management controller and is used for transmitting the memory fault signal to the baseboard management controller;

the baseboard management controller is used for removing the write protection state of the power management chip register, and determining candidate fault memories according to the memory fault signals;

the base plate management controller is connected with the platform path controller and is used for sending a first message to the platform path controller and then forwarding the first message to the central processing unit by the platform path controller, wherein the first message is used for determining candidate fault memories;

the CPU is connected with the memory and is used for writing a disable command into a power management chip register of the candidate fault memory, wherein the candidate fault memory is contained in the memory, and the disable command is used for not powering on the candidate fault memory when the server enters a starting state;

the baseboard management controller controls the server to perform direct current power-on, so that the server enters a starting state, log information of the candidate fault memory is obtained, and the target fault memory and the slot position information of the target fault memory are determined from the candidate fault memory according to the log information.

According to the fault memory location system provided by the embodiment, the write protection state of the power management chip register is relieved by the substrate management controller, so that candidate writing of the disable command into the candidate fault memory is facilitated. And the baseboard management controller determines candidate fault memories according to the memory fault signals, and writes a disable command into a power management chip register of the candidate fault memories by utilizing the central processing unit, so that the influence of the candidate fault memories on the starting of the server is avoided. Under the condition that the server enters a starting state, the baseboard management controller acquires log information of the candidate fault memory, and determines target fault memory and slot position information according to the log information, so that the specific memory slot position of the target fault memory can be directly positioned. The method solves the problem that abnormal information capable of accurately positioning the fault memory is difficult to obtain and the PMIC fault memory cannot be positioned.

In an alternative embodiment, the baseboard management controller is connected to the central processing unit through a first link;

the baseboard management controller is used for judging whether first type memory error reporting information exists in the log information;

under the condition that first type memory error reporting information exists in the log information, the baseboard management controller is used for determining candidate fault memories corresponding to the first type memory error reporting information as target fault memories, acquiring relevant information of a power management chip register in the target fault memories from the central processing unit through a first link, determining slot position information of the target fault memories according to the log information, and generating alarm information according to the relevant information and the slot position information;

The baseboard management controller is used for judging whether second type memory error reporting information exists in the log information;

under the condition that the second type of memory error reporting information exists in the log information, the baseboard management controller is used for taking the candidate fault memory corresponding to the second type of memory error reporting information as a normal memory;

the baseboard management controller is used for judging whether the third type of memory error reporting information exists in the log information;

and under the condition that the third type of memory error reporting information exists in the log information, the baseboard management controller is used for taking the candidate fault memory corresponding to the third type of memory error reporting information in the log information as a fault memory to be analyzed, wherein the fault memory to be analyzed is a candidate fault memory except the target fault memory and the normal memory.

In this embodiment, the baseboard management controller determines the candidate fault memory corresponding to the first type of memory error reporting information in the log information as the target fault memory, and determines the slot position information of the target fault memory, so that the method is simple and efficient, can directly locate to a specific memory slot position, and is convenient for maintenance personnel to replace the target fault memory. The base plate management controller takes the candidate fault memory corresponding to the second type of memory error information in the log information as a normal memory, and takes the candidate fault memory corresponding to the third type of memory error information as a fault memory to be analyzed, so that the invention can determine multiple types of fault memories, and the application range of the invention is enlarged.

In an alternative embodiment, the complex programmable device is connected to the baseboard management controller via a second link, and the baseboard management controller is connected to the platform path controller via a second link;

the complex programmable device transmits a memory fault signal to the baseboard management controller through a second link;

the baseboard management controller sends a first message to the platform path controller over a second link.

In this embodiment, the complex programmable device transmits a memory failure signal to the baseboard management controller through the second link, so that the baseboard management controller can determine the candidate failure memory. The baseboard management controller sends a first message to the platform path controller through a second link, so that the platform path controller determines which memories are candidate fault memories, and the central processing unit is conveniently informed of the fault memories.

In an alternative embodiment, the central processor is connected to the memory through a third link;

and the central processing unit writes a disabling command into the power management chip register of the candidate fault memory through a third link.

In this embodiment, the central processor writes the disable command into the power management chip register of the candidate fault memory through the third link, so as to avoid the influence of the candidate fault memory on the startup of the server.

In a third aspect, the present invention provides a fault memory location device, the device comprising:

the releasing module is used for releasing the write-protection state of the power management chip register under the condition that the memory fault signal is received and the server is in a shutdown state;

the first determining module is used for determining candidate fault memories according to the memory fault signals;

the writing module is used for writing a forbidden command into a power management chip register of the candidate fault memory, wherein the forbidden command is used for not powering on the candidate fault memory when the server enters a starting state;

the acquisition module is used for acquiring log information of the candidate fault memory under the condition that the server enters a starting state;

and the second determining module is used for determining the target fault memory and the slot position information of the target fault memory from the candidate fault memories according to the log information.

In a fourth aspect, the present invention provides a computer device comprising: the fault memory location method comprises the steps of storing computer instructions in a memory, and executing the computer instructions by the processor, wherein the memory and the processor are in communication connection, and the processor executes the fault memory location method according to the first aspect or any implementation mode corresponding to the first aspect.

In a fifth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the fault memory location method of the first aspect or any of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram illustrating a relationship between an abnormality alert signal and a memory of a power management chip according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for locating a failed memory according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a fault memory location system according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for determining, locating and alerting PMIC abnormal memory according to an embodiment of the present invention;

FIG. 5 is a block diagram of a failed memory location device according to an embodiment of the present invention;

Fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Each channel of the Central Processing Unit (CPU) may be connected to a plurality of memories, for example, two memories may be connected, and 4 memories of each two channels share a power management chip (PMIC) abnormality alarm signal, as shown in fig. 1, a channel a of CPU0 is connected to a first memory of CPU0 channel a and a second memory of CPU0 channel a, a channel B of CPU0 is connected to a first memory of CPU0 channel B and a second memory of CPU0 channel B, and the four memories share a power management chip abnormality alarm signal (pwrgd_fail_cpu 0_ab) to send signals to the complex programmable logic devices (Complex Programmable Logic Device, CPLD). If the four memories are normal, the abnormal alarm signal of the power management chip is high level, and if one of the four memories has PMIC abnormality, the abnormal alarm signal of the power management chip is pulled down to be low level by the fault memory. Because four memories share one power management chip abnormality warning signal, it is impossible to distinguish which memory has the PMIC abnormality.

Based on the above, the embodiment of the invention provides a fault memory location method, and on the premise that the original function of the motherboard is not affected, the motherboard CPLD actively informs the BMC (Baseboard Management Controller ) when detecting that the abnormal alarm signal of the power management chip is pulled down, and simultaneously cooperates with the BMC to determine the fault memory with PMIC abnormality and give out corresponding slot position information. The method and the device have the advantages that the problem that abnormal memory slots cannot be directly located, and then faults (Debug) cannot be directly removed and the problem of server machine room maintenance cannot be solved.

According to an embodiment of the present invention, there is provided an embodiment of a fault memory location method, it should be noted that, steps shown in the flowchart of the drawings may be performed in a server device having data processing capability, and although a logic sequence is shown in the flowchart, in some cases, steps shown or described may be performed in a sequence different from that herein.

In this embodiment, a fault memory location method is provided, which may be used in the above server device, and fig. 2 is a flowchart of a fault memory location method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

Step S201, under the condition that the memory fault signal is received and the server is in a shutdown state, the write protection state of the power management chip register is released.

Specifically, during normal operation, the BIOS (Basic Input Output System ) sends an enable command to the DIMM (Dual-Inline-Memory-Modules), the Memory is normally powered up, and the BIOS sets the power management chip registers (PMIC registers) of all DIMMs to a write-protected state.

When the CPLD detects that the power management chip abnormality alert signal (PWRGD_FAIL, which may be simply referred to as the PWR_FAIL signal) is low, the server will be powered off, and at the same time, since the PMIC register of the DIMM is set to a write-protected state, the PMIC abnormality that causes the PWR_FAIL signal to be pulled low cannot be relieved, which may cause the server to be unable to be powered on. Therefore, it is necessary to release the write-protected state of the power management chip register.

Taking two memories as an example, each channel of the CPU may be connected to each channel, after a Direct Current (DC) of the server is powered on, if two CPUs exist in the server, in a normal state of the server, the two CPUs will correspond to 8 memory pwr_fail signals and are both in a high level state. The present invention refers to a low level PWR _ FAIL signal as a memory FAIL signal. After the CPLD acquires the memory fault signal in real time, waiting for 5 seconds, and confirming that the time sequence of the CPLD goes to an S5 state, wherein the S5 state is one state of the power supply time sequence, namely, a power supply with STBY (standby) is connected to an alternating current (Alternating Current, AC) power supply, and the S5 state represents that the server is in a shutdown state.

The CPLD confirms the problem of PMIC abnormality of the memory, transmits the problem to the BMC, and executes subsequent Debug logic by the BMC, and specifically comprises the following steps: after the BMC receives the memory PMIC abnormal information reported by the CPLD, the PMIC state register of the memory is in a write-protected state, the memory PMIC abnormal state cannot be modified after the mainboard is restarted, and the phenomenon from the memory PMIC abnormal state to shutdown to S5 occurs again. Thus, the BMC may release the write-protected state of the power management chip registers.

It should be noted that, the part using the CPLD in the Debug logic is a simple reading flow, and the CPLD is not required much, and the original chip type selection and use cannot be greatly affected. In addition, the invention designs the Debug FW (Firmware), and in a normal state, the Debug FW can not influence the normal operation of the BMC FW (Firmware); once the memory PMIC abnormality occurs, the related Debug FW operation priority level is mentioned to be the highest, and the abnormality can be positioned and solved as soon as possible.

Step S202, according to the memory fault signal, determining candidate fault memories.

Specifically, taking two memories as an example, each channel of the CPU may be connected, the memory failure signal is a pwr_fail signal of a low level, and the 4 memories share one pwr_fail signal.

The received memory fault signal indicates that at least one memory among the 4 memories sharing the corresponding PWR_FAIL signal has PMIC abnormality, namely the target fault memory. Therefore, 4 memories of the pwr_fail signal corresponding to the common memory fault signal are used as candidate fault memories, and at least one target fault memory exists in the 4 candidate fault memories.

And step S203, writing a disable command into a power management chip register of the candidate fault memory, wherein the disable command is used for not powering on the candidate fault memory when the server enters a starting state.

Specifically, taking an example that each channel of the CPU may be connected to two memories, since one memory fault signal indicates that at least one target fault memory exists in the 4 candidate fault memories, the BMC sends information of the candidate fault memories to the platform path controller (Platform Controller Hub, PCH), the platform path controller informs the CPU, and the CPU writes Disable (Disable) commands into the power management chip registers of all the candidate fault memories. When the server DC is electrified and enters a starting state, the candidate fault memory written with the forbidden command is not electrified, so that the occurrence of abnormality is avoided.

Step S204, under the condition that the server enters a starting state, acquiring log information of the candidate fault memory.

Specifically, the server system is electrified in a DC mode, the server enters a starting state, and the BMC acquires and checks the received PMIC Disable log information.

Step S205, determining the target fault memory and the slot position information of the target fault memory from the candidate fault memories according to the log information.

Specifically, the obtained log information is analyzed, a target fault memory is determined from the candidate fault memories, and slot position information of the target fault memory is determined. And the system can also generate alarm information to tell the server that the maintenance personnel need to replace the memory.

In some alternative embodiments, determining the target fault memory and the slot information of the target fault memory from the candidate fault memories according to the log information includes:

Specifically, the BMC checks the received PMIC Disable log information, and determines whether there is first type memory error reporting information in the log information, where the first type memory error reporting information is, for example: 0X46/0X05 related information.

If the log information is in error reporting information of the first type of memory, the candidate fault memory corresponding to the error reporting information of the first type of memory is the target fault memory with PMIC abnormality. And finally, the BMC collects all log information, and determines the slot position information of the target fault memory according to the log information so as to facilitate maintenance personnel to replace the target fault memory and research and development personnel to further examine the problem.

In some alternative embodiments, after determining the slot information of the target fault memory from the log information, the method further comprises:

Specifically, if the second type of memory error information occurs in the log information, for example: and if the memory error information starts with 0X0A, the candidate fault memory corresponding to the second type of memory error information is normal memory.

The invention refers to other types of memory error reporting information except the first type of memory error reporting information and the second type of memory error reporting information as third type of memory error reporting information. If the error reporting information of the third type of memory appears in the log information, the candidate fault memory corresponding to the error reporting information of the third type of memory is used as the fault memory to be analyzed. The fault memory to be analyzed needs to rely on the CPU memory related register information collected by the BMC through JTAG (Joint Test Action Group, standard test access port and boundary scan structure) for further Debug, and meanwhile, JTAG of the BMC can collect the memory PMIC FW version to confirm whether the memory fault signal is caused by using the old version FW.

In some alternative embodiments, releasing the write-protected state of the power management chip registers includes:

controlling a target main board to perform alternating current and low current;

Specifically, the write protection state of a power management chip (PMIC) register may be released using an alternating current Cycle (AC Cycle) including: AC power down and AC power down.

The target motherboard is a server motherboard. And the BMC controls the corresponding power chip to perform AC power-down operation on the target mainboard through the command. In order to ensure the normal time sequence of starting up, after a preset time, the BMC controls the corresponding power chip to perform AC power-on operation on the target mainboard through a command to complete AC Cycle, and at the moment, the write protection state of the power management chip register is released, and the preset time is 10S for example.

In this embodiment, a fault memory location system is provided, which may be deployed in the server device described above, and the system includes: a complex programmable device, a baseboard management controller, a platform path controller, and a central processing unit;

Specifically, describing the present embodiment with reference to fig. 3, the fault memory location system includes: complex programmable devices (CPLDs), baseboard Management Controllers (BMCs), platform Path Controllers (PCHs), and Central Processing Units (CPUs), wherein the central processing units may be plural, for example: a central processing unit 0 and a central processing unit 1. The power management chip abnormality warning signal (pwrgd_fail, may be simply referred to as pwr_fail signal) is exemplified by a CPU connected to 16 memories, and 4 memories share a pwr_fail signal: the power management chip abnormality warning signal is named as PWRGD_FAIL_CPUx_AB-GH, which indicates that the CPU0 and the CPU1 of the main board are connected with the PWR_FAIL signals of the memory, 8 PWR_FAIL signals are connected to the CPLD corresponding to the two CPUs, and the CPLD can monitor whether the PWR_FAIL signals trigger or not (the normal state is high level). For example: the power management chip exception alert signal used by channel A of CPU0, which is connected to memory CPU0_CHA_DIMM1, may be named PWRGD_FAIL_CPU0_A-GH.

After the CPLD is used for acquiring the memory fault signal in real time, waiting for 5 seconds, and confirming that the time sequence of the CPLD is in an S5 state, wherein the S5 state is one state of the power supply time sequence, namely, a power supply with STBY (standby) is connected with an alternating current (Alternating Current, AC) power supply, and the S5 state represents that the server is in a shutdown state.

The CPLD is connected with the BMC, confirms the problem of PMIC abnormality of the memory, and transmits the problem to the BMC, namely, a memory fault signal is transmitted to the baseboard management controller.

The BMC is used for releasing the write protection state of the power management chip register. Taking two memories as an example, each channel of the CPU can be connected, the memory fault signal is a pwr_fail signal of a low level, and the 4 memories share one pwr_fail signal. The received memory fault signal indicates that at least one memory among the 4 memories sharing the corresponding PWR_FAIL signal has PMIC abnormality, namely the target fault memory. Therefore, the BMC uses the 4 memories of the pwr_fail signal corresponding to the shared memory fault signal as candidate fault memories.

The BMC is connected to the PCH and sends a first message to the PCH, the first message including which memories are candidate failed memories. The PCH passes a first message to the CPU through BIOS software, informing the CPU which memories are candidate failed memories.

The CPU is connected with the memory, and writes a Disable command into the power management chip registers of all the candidate fault memories.

The BMC powers on the server system in a DC mode, the server enters a starting state, and the BMC acquires and checks the received PMIC Disable log information. And determining a target fault memory from the candidate fault memories by analyzing the acquired log information, and determining the slot position information of the target fault memory.

It should be noted that, in the connection manner of the server motherboard BMC and the key devices such as the CPLD and the PCH, the motherboard CPLD obtains the state of the memory pwr_fail signal and cooperates with the BMC to analyze and solve the Debug flow and logic of the memory PMIC exception alarm, which can help solve the existing problem that such Bug cannot be directly located and the problem that the machine room maintenance cannot be directly maintained like exception.

In some alternative embodiments, the baseboard management controller is connected to the central processing unit via a first link;

Specifically, the Baseboard Management Controller (BMC) checks the received PMIC Disable log information, and determines whether there is first type memory error reporting information in the log information, where the first type memory error reporting information is, for example: 0X46/0X05 related information.

If the log information is in error reporting information of the first type of memory, the candidate fault memory corresponding to the error reporting information of the first type of memory is the target fault memory with PMIC abnormality. The BMC acquires related abnormal register information, namely related information of a power management chip register in the target fault memory, from the central processing unit through the first link, determines slot information of the target fault memory according to the log information, generates alarm information according to the related information and the slot information, reports the alarm information on the BMC so as to enable maintenance personnel to replace the target fault memory, and further checks the problem by research personnel. The BMC records all the collected log information. The first link is for example: JTAG (Joint Test Action Group, standard test Access Port and boundary Scan architecture), the JTAG bus is a bus dedicated to Debug and Debug connecting the CPU and PCH, and the BMC can obtain the values of the memory PMIC register and the CPU memory controller register through the JTAG bus for parsing.

The BMC determines whether there is a second type of memory error information in the log information, for example: and if the second type of memory error reporting information occurs in the log information, the candidate fault memory corresponding to the second type of memory error reporting information is normal memory.

The invention refers to other types of memory error reporting information except the first type of memory error reporting information and the second type of memory error reporting information as third type of memory error reporting information. And the BMC judges whether the second type of memory error reporting information exists in the log information, and if the third type of memory error reporting information exists in the log information, the candidate fault memory corresponding to the third type of memory error reporting information is used as the fault memory to be analyzed. The fault memory to be analyzed needs to be further debuged by means of CPU memory related register information collected by the BMC through JTAG, meanwhile, JTAG of the BMC can collect the memory PMIC FW version to confirm whether the memory fault signal is caused by using the old version FW.

In some alternative embodiments, the complex programmable device is coupled to the baseboard management controller via a second link, the baseboard management controller being coupled to the platform path controller via a second link;

Specifically, as shown in fig. 3, the complex programmable device (CPLD) is connected to the Baseboard Management Controller (BMC) through a second link, and the Baseboard Management Controller (BMC) is connected to the platform Path Controller (PCH) through a second link. The second link is for example: I2C (Inter-Integrated Circuit, an integrated circuit bus, also referred to as IIC), the present invention designates the second link between CPLD and BMC as I2C_BMC_CPLD, and the second link between BMC and PCH as I2C_BMC_PCH. I2c_bmc_cpld: indicating I2C buses of a main board BMC and a CPLD, wherein the BMC can pass through the CPLD

Acquiring abnormal information of the memory, and acquiring power-on time sequence information of the main board; i2c_bmc_pch: the I2C buses of the BMC and the PCH are indicated, the BMC can acquire abnormal information of a memory state register in the CPU through the PCH, and meanwhile, the bus channels for reading and writing the CPU register are provided.

The complex programmable device communicates a memory failure signal to the baseboard management controller via a second link, such as: the CPLD transmits the memory fault signal to the BMC through the I2C_BMC_CPLD. The baseboard management controller sends a first message to the platform path controller over a second link, for example: the BMC sends a first message to the PCH via the I2C_BMC_PCH.

In some alternative embodiments, the central processor is coupled to the memory via a third link;

Specifically, the CPU is connected to the memory through a third link, as shown in fig. 3, the CPU0 is connected to the memory connected to the CPU0 through the third link, and the CPU1 is connected to the memory connected to the CPU1 through the third link. Third link for example: I3C (Improved Inter Integrated Circuit, improved integrated circuit bus, also referred to as IIIC). The CPU can acquire the abnormal information of the memory through the I3C bus connected with the memory by the CPU of the main board, and can read and write the memory PMIC register at the same time;

Taking the channel a of the CPU0 as an example, the invention names the third link of the channel a of the CPU0 as i3c_cpu0_cha-H, and the channel a of the CPU0 is connected with the CPU0_cha_dimm1 and the CPU0_cha_dimm2 through the i3c_cpu 0_cha-H.

The central processor writes a Disable command to the power management chip register of the candidate fault memory through the I3C, for example: CPU0 writes a disable command to candidate failed memory CPU0_CHA-B_DIMM1 through I3C_CPU0_CHA-H of channel A.

In this embodiment, a method for determining, locating and alarming abnormal memory of PMIC is provided, which can solve the same problems and produce the same technical effects as the above steps S201 to S205, and as shown in fig. 4, the flow of the method includes the following steps:

the CPLD monitors that a low level occurs in the memory PWR_FAIL signal; the CPLD monitors the server time sequence to S5 state (server power-off state); the CPLD transmits the information to the BMC through the I2C; the BMC records the information to a system log and controls the mainboard to be powered down; after 10S, the BMC controls the main board to be electrified to complete the AC Cycle, and the PMIC releases the write protection state; the BMC informs the PCH of possible abnormal positions through I2C, and the PCH is transmitted to the CPU through BIOS software; the CPU writes Disable into the PMIC of the suspected problem memory through the I3C, so that the power-on-off function of the memory is realized when the memory is started; the BMC control system performs a starting action, the system can be started normally, and the PMIC abnormal memory will report errors; checking whether all memories report errors by 0X46/0X05, if so, confirming that the saved memory is PMIC abnormal memory, and collecting related abnormal register information in the CPU by the BMC through JTAG; and reporting alarm information on the BMC, recording all collected log information, if not, checking whether all the memory errors are 0X0A, if so, judging that the type of memory errors are normal memories, if not, judging that the type of memory errors are other types of errors, and ensuring that the normal starting is not affected by non-PMIC abnormality for further analysis.

In this embodiment, a method for determining, locating and alarming a PMIC abnormal memory is provided, where on the premise that the original function of the motherboard is not affected, the CPLD actively informs the BMC to locate and analyze the cause of a memory PMIC fault problem and provide corresponding problem slot information simultaneously when detecting that the pwr_fail signal of the memory is low, thereby improving the accuracy and integrity of locating the type of problem, and simultaneously automatically processing the type of fault (Bug) and reporting alarm information; the problem that the abnormal memory slot position cannot be directly positioned in the prior art, and then the Debug and the server room maintenance cannot be directly carried out is effectively solved.

In this embodiment, a fault memory location device is further provided, and the fault memory location device is used to implement the foregoing embodiments and preferred embodiments, which are not described herein. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a fault memory location device, as shown in fig. 5, including:

A releasing module 501, configured to release the write protection state of the power management chip register when the memory failure signal is received and the server is in a shutdown state;

a first determining module 502, configured to determine a candidate fault memory according to the memory fault signal;

a writing module 503, configured to write a disable command to a power management chip register of the candidate fault memory, where the disable command is configured to not power up the candidate fault memory when the server enters a power-on state;

an obtaining module 504, configured to obtain log information of the candidate fault memory when the server enters a power-on state;

the second determining module 505 is configured to determine, according to the log information, the target fault memory and slot information of the target fault memory from the candidate fault memories.

In some alternative embodiments, the second determining module 505 includes:

the judging unit is used for judging whether first type memory error reporting information exists in the log information;

the determining unit is used for determining the candidate fault memory corresponding to the first type memory error reporting information as the target fault memory under the condition that the first type memory error reporting information exists, and determining the slot position information of the target fault memory according to the log information.

In some alternative embodiments, the second determining module 505 includes:

the first unit is used for taking the candidate fault memory corresponding to the second type memory error reporting information in the log information as a normal memory;

and the second unit is used for taking the candidate fault memory corresponding to the error reporting information of the third type memory in the log information as the fault memory to be analyzed, wherein the fault memory to be analyzed is a candidate fault memory except the target fault memory and the normal memory.

In some alternative embodiments, the disarming module 501 includes:

the first control unit is used for controlling the target main board to perform alternating current and low current;

and the second control unit is used for controlling the target mainboard to carry out alternating current power-on after the preset time so as to finish the removal of the write-protection state of the power management chip register.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The fault memory location means in this embodiment is presented in the form of functional units, here referred to as ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or firmware programs, and/or other devices that can provide the above described functionality.

The embodiment of the invention also provides computer equipment, which is provided with the fault memory positioning device shown in the figure 5.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 6, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 6.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. The fault memory positioning method is characterized by comprising the following steps:

determining candidate fault memories according to the memory fault signals;

writing a disable command into the power management chip register of the candidate fault memory, wherein the disable command is used for not powering on the candidate fault memory when the server enters a starting state;

under the condition that the server enters a starting state, acquiring log information of the candidate fault memory;

and determining a target fault memory and slot position information of the target fault memory from the candidate fault memories according to the log information.

2. The method of claim 1, wherein determining, from the candidate failed memories, the target failed memory and slot information of the target failed memory based on the log information, comprises:

and under the condition that the first type of memory error reporting information exists, determining the candidate fault memory corresponding to the first type of memory error reporting information as the target fault memory, and determining the slot position information of the target fault memory according to the log information.

3. The method of claim 2, wherein after said determining said slot information for said target failed memory from said log information, said method further comprises:

taking the candidate fault memory corresponding to the second type memory error reporting information in the log information as a normal memory;

4. A method according to any one of claims 1 to 3, wherein said releasing the write-protected state of the power management chip register comprises:

controlling a target main board to perform alternating current and low current;

and after the preset time, controlling the target mainboard to carry out alternating current power-on, and completing releasing the write-protection state of the power management chip register.

5. A fault memory location system, the system comprising: a complex programmable device, a baseboard management controller, a platform path controller, and a central processing unit;

the baseboard management controller is connected with the platform path controller and is used for sending a first message to the platform path controller and then forwarding the first message to the central processing unit by the platform path controller, wherein the first message is used for determining the candidate fault memory;

the CPU is connected with the memory and is used for writing a disable command into the power management chip register of the candidate fault memory, wherein the candidate fault memory is contained in the memory, and the disable command is used for not powering on the candidate fault memory when the server enters a starting state;

The baseboard management controller controls the server to perform direct current power-on, so that the server enters a starting state, log information of the candidate fault memory is obtained, and a target fault memory and slot position information of the target fault memory are determined from the candidate fault memory according to the log information.

6. The system of claim 5, wherein the baseboard management controller is connected to the central processing unit via a first link;

when the first type of memory error reporting information exists in the log information, the baseboard management controller is used for determining that the candidate fault memory corresponding to the first type of memory error reporting information is the target fault memory, acquiring related information of the power management chip register in the target fault memory from the central processing unit through the first link, determining the slot information of the target fault memory according to the log information, and generating alarm information according to the related information and the slot information;

When the second type of memory error reporting information exists in the log information, the baseboard management controller is used for taking the candidate fault memory corresponding to the second type of memory error reporting information as a normal memory;

the baseboard management controller is used for judging whether third type memory error reporting information exists in the log information;

7. The system of claim 5, wherein the complex programmable device is coupled to the baseboard management controller via a second link, the baseboard management controller being coupled to the platform path controller via the second link;

the complex programmable device transmits the memory fault signal to the baseboard management controller through the second link;

the baseboard management controller sends the first message to the platform path controller through the second link.

8. The system of claim 5, wherein the central processor is coupled to the memory via a third link;

and the central processing unit writes the disabling command into the power management chip register of the candidate fault memory through the third link.

9. A fault memory location device, the device comprising:

the writing module is used for writing a forbidden command into the power management chip register of the candidate fault memory, wherein the forbidden command is used for not powering on the candidate fault memory when the server enters a starting state;

and the second determining module is used for determining a target fault memory and slot position information of the target fault memory from the candidate fault memories according to the log information.

10. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of fault memory location of any of claims 1 to 4.

11. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the fault memory location method of any one of claims 1 to 4.