CN115904773A - Memory fault information collection method and device and storage medium - Google Patents

Memory fault information collection method and device and storage medium Download PDF

Info

Publication number
CN115904773A
CN115904773A CN202211304380.9A CN202211304380A CN115904773A CN 115904773 A CN115904773 A CN 115904773A CN 202211304380 A CN202211304380 A CN 202211304380A CN 115904773 A CN115904773 A CN 115904773A
Authority
CN
China
Prior art keywords
fault
memory
band controller
server
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211304380.9A
Other languages
Chinese (zh)
Inventor
张殿生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202211304380.9A priority Critical patent/CN115904773A/en
Publication of CN115904773A publication Critical patent/CN115904773A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Retry When Errors Occur (AREA)

Abstract

The embodiment of the application provides a memory fault information collection method, a memory fault information collection device and a memory medium, relates to the field of data storage, and can avoid failure information from being reported. The method is applied to a server, the server comprises a memory, an out-of-band controller and an in-band controller, and the method comprises the following steps: in the operation process of the server, the in-band controller detects the fault of the memory and sends the detected fault information to the out-of-band controller; in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines whether newly-added fault information exists in a memory, wherein the newly-added fault information is fault information which does not exist in the out-of-band controller; and if the newly added fault information is stored in the memory, the in-band controller sends the newly added fault information to the out-band controller.

Description

Memory fault information collection method and device and storage medium
Technical Field
The embodiment of the application relates to the field of data storage, in particular to a memory fault information collection method, a memory fault information collection device and a memory medium.
Background
With the continuous complication of internet services, the failure rate of the memory in the server is also continuously rising, and therefore, how to collect memory failure information becomes a key point of concern in the field.
The existing memory fault information collection method cannot collect fault information in a memory before a server where the memory is located is cold restarted, so that partial fault information in the memory is omitted.
Disclosure of Invention
The embodiment of the application provides a memory fault information collection method, a memory fault information collection device and a memory medium, which can avoid failure information from being reported.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, an embodiment of the present application provides a memory fault information collection method, where the method is applied to a server, where the server includes a memory, an out-of-band controller, and an in-band controller, and the method includes: in the operation process of the server, the in-band controller detects the fault of the memory and sends the detected fault information to the out-of-band controller; in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines whether newly-added fault information exists in a memory, wherein the newly-added fault information is fault information which does not exist in the out-of-band controller; and if the memory stores the newly added fault information, the in-band controller sends the newly added fault information to the out-band controller.
According to the memory fault information collection method provided by the embodiment of the application, in the operation process of a server, an in-band controller carries out fault detection on a memory and sends detected fault information to an out-of-band controller; in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines newly-added fault information in the memory and sends the newly-added fault information to the out-band controller so that the out-band controller collects the newly-added fault information; thereby avoiding failure information from being reported.
In a possible implementation manner, the memory includes a persistent memory DCPMM.
The method comprises the steps that the determination in a server is DCPMM, fault information stored in a memory cannot be reset even if the server is restarted in a cold mode, and therefore newly added fault information is guaranteed not to be lost, a follow-up in-band controller determines the newly added fault information from the DCPMM, and the newly added fault information is sent to an out-of-band controller, and the out-of-band controller collects the newly added fault information; therefore, the problem of failure information missing report is solved.
In a possible implementation manner, in the restarting process of the server, if the start mode of the server is a cold restart, the determining, by the in-band controller, whether new failure information exists in the memory includes: in the process of restarting the server, if the starting mode of the server is cold restart and the asynchronous refresh ADR function of the DCPMM is started, the in-band controller determines whether the memory has newly added fault information.
In the memory fault information collection method provided by the embodiment of the application, in the process of restarting a server, if the starting mode of the server is cold restart and the asynchronous refresh ADR function of the DCPMM is started, an in-band controller determines newly added fault information in the memory and sends the newly added fault information to an out-of-band controller, so that the out-of-band controller collects the newly added fault information; therefore, the memory fault information collection method provided by the embodiment of the application is avoided being executed by the in-band controller when the server is in cold restart (at the moment, the cold restart of the server is normal cold restart and is not cold restart caused by downtime fault) under the condition that the ADR function of the DCPMM is closed, and further the computing resources of the in-band controller are saved.
In one possible implementation manner, in the process of restarting the server, if the ADR function of the DCPMM is turned off and the restart type of the server is down restart, the in-band controller determines whether new failure information exists in the memory.
In the memory fault information collection method provided by the embodiment of the application, in the process of restarting the server, if the ADR function of the DCPMM is turned off (namely, during downtime and warm restart), the in-band controller determines newly-added fault information in the memory and sends the newly-added fault information to the out-of-band controller, so that the out-of-band controller collects the newly-added fault information, and the fault information is prevented from being missed when the server is down and warm restarted.
In a possible implementation manner, during a restart of the server, the method further includes: the in-band controller acquires the starting state of the ADR function from the DCPMM; the on state of the ADR function includes on or off.
In a possible implementation manner, the fault information includes a fault identifier, where the fault identifier is used to indicate a sequence of occurrence of a memory fault; the memory failure is a failure that has occurred in the memory of the server.
In a possible implementation manner, in the operation process of the server, the in-band controller performs fault detection on the memory and sends the detected fault information to the out-of-band controller; the method comprises the following steps: in the operation process of the server, when the non-downtime fault occurs in the internal memory, the in-band controller sends the non-downtime fault information to the out-band controller, so that the out-band controller stores the fault identification in the non-downtime fault information.
In the method for collecting memory fault information provided by the embodiment of the application, in the operating stage of the server, when a non-downtime fault occurs to a memory, the in-band controller directly sends the non-downtime fault information to the out-band controller, so that the out-band controller finishes collecting the non-downtime fault, and the collection efficiency of the out-band controller on the memory fault is improved.
In one possible implementation manner, the determining, by the in-band controller, whether new failure information exists in the memory includes: the method comprises the steps that an in-band controller obtains a first fault identifier from an out-of-band controller; the first fault identifier is a fault identifier of a memory fault stored in an out-of-band controller; the in-band controller acquires a second fault identifier from a fault register of the memory; the second fault identifier is a fault identifier of a memory fault occurring in the memory; and the in-band controller determines whether the memory stores newly-added fault information or not according to the first fault identifier and the second fault identifier.
Because the DCPMM is a nonvolatile memory, no matter the server where the DCPMM is located is in cold restart or hot restart, or the server is in the operation stage, the fault information stored in the fault register of the DCPMM cannot be lost, and the second fault identifier is obtained from the fault register, so that the second fault identifier can be obtained from the fault register even if the server is in cold restart. Then, determining newly-added fault information based on the second fault identification and the first fault identification stored in the out-of-band controller, and enabling the out-of-band controller to collect the newly-added fault information; therefore, the problem of failure information missing report is solved.
In a possible implementation manner, the first fault identifier is a fault identifier of a memory fault with the latest occurrence time in memory faults backed up by an out-of-band controller; the second failure flag is a failure flag of a memory failure occurring at the latest time in the memory.
Compared with the scheme that the collected fault identification information of the memory faults is stored in the out-of-band controller, the memory fault information collection method provided by the embodiment of the application is characterized in that the fault identification of the memory fault collected by the out-of-band controller finally is stored in the out-of-band controller, so that the occupation of storage resources of the out-of-band controller is reduced.
In a second aspect, a memory failure information collection apparatus is provided. In one example, the apparatus may be an in-band controller. The device includes: the device comprises a detection unit, a transmitting and receiving unit and a determination unit; the detection unit is used for carrying out fault detection on the memory in the operation process of the server; the receiving and transmitting unit is used for sending the detected fault information to the out-of-band controller; the determining unit is configured to determine whether newly added fault information exists in a memory if a start mode of the server is a cold restart mode in a server restart process, where the newly added fault information is fault information that is not stored in the out-of-band controller; the receiving and sending unit is further used for sending the newly increased fault information to the out-of-band controller if the newly increased fault information is stored in the internal memory.
In a possible implementation manner, the memory includes a persistent memory DCPMM.
In a possible implementation manner, the determining unit is specifically configured to determine whether new failure information exists in the memory if the start mode of the server is a cold restart and the asynchronous refresh ADR function of the DCPMM is turned on in the process of restarting the server.
In a possible implementation manner, the determining unit is further configured to determine whether new failure information exists in the memory during a restart of the server if the ADR function of the DCPMM is turned off and the restart type of the server is a downtime restart.
In a possible implementation manner, the transceiver unit is configured to obtain an activation state of an ADR function from the DCPMM during a restart process of the server; the on state of the ADR function includes on or off.
In a possible implementation manner, the fault information includes a fault identifier, where the fault identifier is used to indicate a sequence of occurrence of the memory fault; the memory failure is a failure that has occurred in the memory of the server.
In a possible implementation manner, the transceiver unit is configured to send non-downtime fault information to the out-of-band controller when a non-downtime fault occurs in the memory during the operation of the server, so that the out-of-band controller stores a fault identifier in the non-downtime fault information.
In a possible implementation manner, the transceiver unit is configured to obtain a first fault identifier from an out-of-band controller; the first fault identification is a fault identification of a memory fault stored in the out-of-band controller; the receiving and sending unit is further configured to obtain a second fault identifier from a fault register of the memory; the second fault identifier is a fault identifier of a memory fault occurring in the memory; the determining unit is configured to determine whether the memory stores new failure information according to the first failure identifier and the second failure identifier.
In a possible implementation manner, the first failure identifier is a failure identifier of a memory failure with the latest occurrence time among memory failures backed up by the out-of-band controller; the second failure flag is a failure flag of a memory failure occurring at the latest time in the memory.
In a third aspect, a server is provided, including: memory, in-band controller and out-of-band controller. Wherein the in-band controller cooperates with the memory and the out-of-band controller to perform any one of the methods provided by the first aspect.
In a fourth aspect, there is provided a computer device comprising: the method comprises the following steps: the device comprises a processor and a memory, wherein the processor is connected with the memory. The memory is used for storing computer-executable instructions, and the processor executes the computer-executable instructions stored by the memory, thereby implementing any one of the methods provided by the first aspect. In one example, the processor herein may be an in-band controller.
In a fifth aspect, a chip is provided, for example, the chip is a chip including an in-band controller, and the chip includes: a processor and interface circuitry; the interface circuit is used for receiving the code instruction and transmitting the code instruction to the processor; a processor for executing code instructions to perform any of the methods provided by the first aspect above.
In a sixth aspect, there is provided a computer-readable storage medium storing computer-executable instructions for causing a computer to perform any one of the methods provided by the first aspect when the computer-executable instructions are executed on the computer
In a seventh aspect, a computer program product is provided, which comprises computer executable instructions, when the computer executable instructions are executed on a computer, the computer is caused to perform any one of the methods provided by the first aspect.
For technical effects brought by any one of the design manners in the second aspect to the seventh aspect, reference may be made to the technical effects brought by different design manners in the first aspect, and details are not repeated here.
Drawings
Fig. 1 is an architecture diagram of a server according to an embodiment of the present application;
fig. 2 is a first flowchart illustrating a method for collecting memory failure information according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a method for collecting memory failure information during server operation according to an embodiment of the present disclosure;
fig. 4 is a second flowchart illustrating a method for collecting memory fault information according to an embodiment of the present disclosure;
fig. 5 is a third schematic flowchart of a method for collecting memory fault information according to an embodiment of the present application;
fig. 6 is a fourth schematic flowchart of a method for collecting memory fault information according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a memory fault information collection apparatus according to an embodiment of the present disclosure.
Detailed Description
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
The terms "first" and "second," and the like, in the description and in the claims of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order of the objects. For example, the first fault identifier and the second fault identifier are used to distinguish different fault identifiers, rather than describing a specific order of the fault identifiers.
In the embodiments of the present application, the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
In the description of the embodiments of the present application, the meaning of "a plurality" means two or more unless otherwise specified. For example, a plurality of processing units refers to two or more processing units; a plurality of systems refers to two or more systems.
First, some concepts related to a memory failure information collection method, device and storage medium provided by the embodiments of the present application are explained.
Cold restart: and (3) restarting the electronic equipment under the condition that the power supply of the electronic equipment is cut off. The cold reboot process executes a hardware self-check process in which a memory in the electronic device is initialized, so that data stored in the volatile memory is reset.
And (3) hot restart: the process is that the electronic device is restarted under the condition that the electronic device is not powered off. The hot reboot process does not execute the hardware self-test flow, so that the data stored in the volatile memory is not reset.
Persistent memory (DCPMM): the memory is a nonvolatile memory or a persistent memory, wherein the DCPMM comprises a volatile memory area and a nonvolatile memory area; when the electronic device where the DCPMM is located is in cold restart, the electronic device does not provide power for the DCPMM any more. At this time, the DCPMM is powered by a power supply of the DCPMM, so that the DCPMM stores data (volatile data for short) stored in a volatile memory area into a non-volatile memory area, thereby completing the persistence of the volatile data, which is called an asynchronous refresh (ADR) process of the DCPMM.
Fault register in memory: and the memory fault information is stored. The memory fault information may include Uncorrectable Error information (UCE) and Correctable Error information (CE), and if the server hardware detects the UCE, the server may be down and restarted, and if the server hardware detects the CE, the server hardware may perform self-repair on the CE by using a part of resources, and when the number of CEs in the memory is too large (if the number of CEs is greater than or equal to a preset value) and the self-repair cannot be performed, the UCE may be generated, which causes the server to be down and restarted.
In the following examples, an embodiment of the present application provides a memory failure information collection method, where the method includes: in the operation process of the server, the in-band controller detects the fault of the memory and sends the detected fault information to the out-band controller; in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines newly-added fault information in the memory and sends the newly-added fault information to the out-band controller so that the out-band controller collects the newly-added fault information; thereby avoiding failure information from being reported under the control of the computer.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present disclosure. The system architecture diagram is illustrative of a computer device. Referring to fig. 1, a hardware portion of the computer device mainly includes an in-band controller, an out-of-band controller, and a memory, and a software portion mainly includes an out-of-band management module, processor firmware, and an Operating System (OS) management unit. The out-of-band management module is located in an out-of-band controller, the OS management unit is located in an in-band controller, and the processor firmware may be located in the in-band controller (as shown in fig. 1), or the processor firmware may also be located in a firmware chip (not shown in fig. 1) outside the in-band controller, and the in-band controller may be a Central Processing Unit (CPU). The out-of-band management module may be a management unit of the non-service module. For example, the out-of-band management module may be completely independent of the operating system of the computer device and may communicate with the BIOS and OS (or OS management unit) through an out-of-band management interface of the computer device.
For example, the out-of-band management module may include a monitoring management unit outside the computer device, a management system in a management chip outside the in-band controller, a BMC (baseboard management unit) or an Intelligent Management Unit (IMU), a System Management Module (SMM) and the like. It should be noted that, the specific form of the out-of-band management module in the embodiments of the present application is not limited, and the above description is only an example. In the following embodiments, only the out-of-band management module is taken as the BMC for illustration.
Illustratively, the processor Firmware (also referred to as a processor Firmware program) may be Firmware, BIOS, management Engine (ME), microcode, or the like. It should be noted that, the specific form of the processor firmware in the embodiments of the present application is not limited, and the above is only an exemplary description. In the following embodiments, only the processor firmware is taken as an example of the BIOS.
It should be noted that the in-band controller described in the following embodiments performs certain steps (such as the following S110-S130), which can be understood as follows: the in-band controller calls the processor firmware to perform this step.
The memory, also called as internal memory or main memory, includes a volatile memory and a nonvolatile memory, which are installed in memory slots on a motherboard of the computer device, and the memory controller communicate with each other through a memory channel (channel). The memory has a fault register, wherein the fault register is used for storing the fault information in the fault register when the memory is in fault.
It should be noted that the system architecture and the application scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
The embodiment of the application provides a memory fault information collection method, which is applied to a server, wherein the server comprises a memory, an out-of-band controller and an in-band controller; as shown in fig. 2, the method includes: S110-S130.
S110, in the running process of the server, the in-band controller detects the fault of the memory and sends the detected fault information to the out-of-band controller.
The memory includes: dynamic Random Access Memory (DRAM) or DCPMM.
The fault information includes: the fault identification is used for indicating the sequence of the memory faults; the memory failure is a failure that has occurred in the memory of the server.
It should be noted that the failure identifier may be a failure serial number of the memory failure, or may also be a failure occurrence timestamp of the memory failure, and the specific representation form of the first failure identifier is specifically defined in the present application.
The specific implementation of the foregoing 110, as shown in fig. 3, includes: s110a-S110e.
And S110a, the in-band controller acquires non-downtime fault information when the non-downtime fault occurs in the internal memory in the operation process of the server.
It should be understood that, when a non-downtime fault occurs in the memory of the server in the operating phase, the in-band controller may directly acquire the non-downtime fault information.
It should be noted that the non-downtime fault is a memory fault that does not cause the downtime of the server; the memory failure is a failure that has occurred in the memory in the server. The non-downtime fault may be a read-write timeout fault in the CE fault.
And S110b, the in-band controller sends the non-downtime fault information to the out-band controller.
It should be noted that the above-described in-band controller transmits the detected failure information to the out-of-band controller, and it is essential that the out-of-band controller collects the detected failure information. In one implementation, the purpose of collecting the detected fault information by the out-of-band controller is to analyze the fault information and output the fault information, so that the user can timely grasp and process the fault.
S110c, judging whether a first fault identifier exists locally by the out-of-band controller.
The first fault identifier is a fault identifier of a memory fault stored in the out-of-band controller, that is, the out-of-band controller stores the fault identifier in the collected fault information locally; the locally stored fault identifications are collectively referred to as first fault identifications.
When the backup failure flag exists in the out-of-band controller, it is described that the server where the out-of-band controller is located does not execute the memory failure information collection method provided in the embodiment of the present application for the first time, and the out-of-band controller executes the following S110d.
When the backup failure flag does not exist in the out-of-band controller, it is described that the server where the out-of-band controller is located executes the memory failure information collection method provided in the embodiment of the present application for the first time, and at this time, the out-of-band controller executes the following S110e.
And S110d, the out-of-band controller updates the first fault identifier according to the fault identifier of the non-downtime fault.
The specific implementation of S110d is that the out-of-band controller stores the fault identifier of the non-downtime fault on the basis of the first fault identifier, so that the first fault identifier includes the fault identifier of the non-downtime fault.
For example, assuming that the fault identifier is a fault sequence number, the fault sequence number is sequentially increased along with the sequence of occurrence of the fault; the out-of-band controller has stored a fault serial number 1 and a fault serial number 2, namely, the fault serial number 1 and the fault serial number 2 are both first fault identifications; the fault sequence number of the non-downtime fault is a fault sequence number 3. Then it is. The out-of-band controller stores a fault serial number 3 on the basis of the fault serial number 1 and the fault serial number 2 so as to enable the locally stored fault serial number to be a fault serial number 1-3; at this point, the first failure is identified as a failure sequence number 1-3.
Optionally, when the first fault identifier is a fault identifier of a memory fault occurring at the latest time among memory faults stored in the out-of-band controller; the specific implementation of S110d is: and the out-of-band controller updates the first fault identification into the fault identification of the non-downtime fault.
For example, assuming that the fault identifier is a fault sequence number, the fault sequence number is sequentially increased along with the sequence of occurrence of the fault; the method comprises the steps that a fault 1 and a fault 2 are collected in an out-of-band controller, wherein a fault serial number 1 corresponding to the fault 1 is smaller than a fault serial number 2 corresponding to the fault 2, so that the fault serial number 2 is determined as a first fault identifier and stored in the out-of-band controller; the fault serial number of the non-downtime fault is a fault serial number 3; the out-of-band controller updates the first fault identification from fault sequence number 2 to fault sequence number 3; at this time, the first failure is identified as a failure sequence number 3.
And S110e, the out-of-band controller stores the fault identification of the non-downtime fault.
Exemplarily, it is assumed that the fault identifier is a fault sequence number, where the fault sequence number is sequentially incremented along with the sequence of occurrence of the fault; if the fault sequence of the non-downtime fault is 3, the out-of-band controller stores the fault identifier with the fault sequence number of 3 in the local; at this time, the first failure flag is a failure sequence of 3.
In the method for collecting the memory fault information provided by the embodiment of the application, in the operation stage of the server, when the non-downtime fault occurs in the memory, the in-band controller directly sends the non-downtime fault information to the out-of-band controller, so that the out-of-band controller finishes collecting the non-downtime fault, and the collection efficiency of the out-of-band controller on the memory fault is improved.
And S120, in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines whether newly added fault information exists in the memory.
The starting mode of the server is cold restart and comprises the following steps: a normal cold restart of a server (e.g., a user manually powering down the server) and a cold restart of the server due to a downtime failure.
The newly added fault information is fault information that is not stored in the out-of-band controller, that is, the newly added fault information is fault information that is not collected in the out-of-band controller among fault information that has occurred in the memory, that is, a fault identifier of the newly added fault information that is not stored in the out-of-band controller.
It should be noted that the new failure information includes: at least one of the sequence number of the newly added fault, the occurrence timestamp of the newly added fault, the occurrence position of the newly added fault and the like may further include other fault information.
And S130, if the memory stores the newly added fault information, the in-band controller sends the newly added fault information to the out-of-band controller.
It should be noted that, the in-band controller sends the newly added fault information to the out-of-band controller, so that the out-of-band controller collects the newly added fault information and locally stores the fault identifier of the newly added fault information.
According to the memory fault information collection method provided by the embodiment of the application, in the operation process of a server, an in-band controller carries out fault detection on a memory and sends the detected fault information to an out-of-band controller; in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines newly-added fault information in the memory and sends the newly-added fault information to the out-band controller so that the out-band controller collects the newly-added fault information; thereby avoiding failure information from being reported.
Based on the memory failure information collection method described in S120-S130, the embodiment of the present application provides 2 specific embodiments of a memory failure information collection method, which are respectively embodiment 1-2 as follows:
example 1
The embodiment of the application provides a memory fault information collection method, which is applied to the process of cold restart of a server; as shown in fig. 4, the method includes: S310-S360.
S310, the in-band controller acquires a first fault identifier from the out-of-band controller.
The first fault identifier is a fault identifier of a memory fault stored in an out-of-band controller; that is to say, the fault flags of all the memory faults stored in the out-of-band controller are the first fault flag.
For example, assuming that a fault 1 and a fault 2 occur in the memory, respectively, when the out-of-band controller of the server where the memory is located has collected information of the faults 1-2, the fault flag of the fault 1 is a fault serial number 1, and the fault flag of the fault 2 is a fault serial number 2, at this time, the fault serial number 1 and the fault serial number 2 are both first fault flags.
The specific implementation of S310 includes: one implementation is as follows: the in-band controller sends a request to the out-of-band controller to obtain a first fault identification, and the out-of-band controller sends the first fault identification to the in-band controller in response to the request. In another implementation, the out-of-band controller actively sends a first fault identification to the in-band controller.
It should be noted that, the first fault flag stored in the out-of-band controller is persistent data, and even if a server where the out-of-band controller is located is in a cold restart, the stored first fault flag still exists in the out-of-band controller.
S320, the in-band controller obtains a second fault identifier from a fault register of the memory.
The memory is DCPMM.
The fault register is used for storing fault information of a fault at a first time when the memory has the fault; that is, after the memory fails, the failure information is stored by the failure register in the memory and then collected by the out-of-band controller, wherein the failure information includes: fault sequence number, fault occurrence timestamp and fault occurrence location. Namely: the fault information stored in the fault register includes: information of faults that have been collected by the out-of-band controller; or fault information that has been collected by the out-of-band controller and fault information that has not been collected by the out-of-band controller.
It should be understood that the above-mentioned fault information that is not collected by the out-of-band controller is fault information that is not collected by the out-of-band controller because the server is restarted when the out-of-band controller has not collected the fault information after the memory is down.
The second fault identifier is a fault identifier of a memory fault occurring in the memory; that is, the fault identifiers of all memory faults stored in the fault register are the second fault identifiers.
For example, based on the example in S310, it is assumed that the memory has faults 3 and 4 after faults 1 and 2; then, the fault register stores the information of the fault 3 and the information of the fault 4 on the basis of the information of the fault 1 and the information of the fault 2; wherein, the fault identification of the fault 3 is a fault serial number 3 and the fault identification of the fault 4 is a fault serial number 4. At this time, the fault serial numbers 1-4 are all the second fault identifications.
S330, the in-band controller determines whether the memory has newly added fault information according to the first fault identifier and the second fault identifier.
It should be noted that the newly added fault flag is a fault flag indicating a fault that has occurred in the memory and is not collected in the out-of-band controller. That is to say, the newly added failure flag is a failure flag of a memory failure that is not stored in the out-of-band controller. Because the first failure identifier is an identifier of a memory failure collected by the out-of-band controller, and the second failure identifier is an identifier of a memory failure occurring in the server, the newly added failure identifier is a difference set of the second failure identifier and the first failure identifier.
Illustratively, based on the example in S320 above, it is assumed that neither fault 3 nor fault 4 is collected by the out-of-band controller, and at this time, the first fault flag includes: fault sequence number 1 and fault sequence 2; the second fault identification includes: a fault sequence number 1, a fault sequence number 2, a fault sequence number 3 and a fault sequence 4; then the in-band controller determines the difference between the second failure flag and the first failure flag (i.e. failure sequence number 3 and failure sequence number 4) as the new failure flag.
And when the newly added fault mark is empty, determining that newly added fault information does not exist in the memory, and executing an ending action by the in-band controller so as to continuously start the server.
When the newly added failure flag is not empty, it is determined that the newly added failure information exists in the memory, and the following S340 is executed.
S340, the in-band controller sends new failure information represented by the new failure identification to the out-band controller.
It should be noted that the new failure information includes: at least one of the sequence number of the newly added fault, the occurrence timestamp of the newly added fault, the occurrence position of the newly added fault and the like may further include other fault information.
And S350, the out-of-band controller receives the newly added fault information sent by the in-band controller.
After the out-of-band controller receives the newly added fault information, the fact is that the out-of-band controller collects the newly added fault information, and in one implementation mode, the newly added fault information is collected to analyze the fault information and output the newly added fault information so that a user can timely master and process the fault.
And S360, storing the newly added fault information by the out-of-band controller.
The backup new failure information is substantially that the out-of-band controller stores the failure identifier (i.e., the new failure identifier) in the new failure information on the basis of the first failure identifier, and at this time, the first failure identifier includes the new failure identifier.
For example, the example out-of-band controller based on the above S330 stores the fault sequence number 3 and the fault sequence 4 again on the basis of the stored fault sequence number 1 and fault sequence 2; at this time, the first fault flag includes: fault sequence number 1, fault sequence 2, fault sequence number 3 and fault sequence 4.
Because the DCPMM is a nonvolatile memory, no matter the server where the DCPMM is located is in cold restart or hot restart, or the server is in the operation stage, the fault information stored in the fault register of the DCPMM cannot be lost, and the second fault identifier is obtained from the fault register, so that the second fault identifier can be obtained from the fault register even if the server is in cold restart. Then, determining newly-added fault information based on the second fault identification and the first fault identification stored in the out-of-band controller, and enabling the out-of-band controller to collect the newly-added fault information; therefore, the problem of failure information missing report is solved.
Example 2
Because the probability of the memory failure is high, a large number of first failure identifiers may exist in the out-of-band controller, so that the occupancy rate of the storage resource of the out-of-band controller is high, and based on this, the embodiment of the present application provides another memory failure information collection method, as shown in fig. 5, the method includes: S410-S460.
S410, the in-band controller acquires a first fault identifier from the out-of-band controller.
The first failure flag is a failure flag of a memory failure occurring at the latest time among the memory failures stored in the out-of-band controller, that is, the first failure flag is a failure flag of a memory failure collected by the out-of-band controller at the latest time from the current time, where the memory failure is a failure that has occurred in the memory of the server.
It should be noted that the identifier of the failure is used to characterize the occurrence sequence of the memory failure. The failure identifier may be a failure serial number of the memory failure, or may be a failure occurrence timestamp of the memory failure, and the specific representation form of the first failure identifier is specifically defined in the present application.
For example, it is assumed that a fault 1 and a fault 2 occur in a memory, respectively, and an out-of-band controller of a server where the memory is located has collected the faults 1-2, where a fault identifier of the fault 1 is a fault sequence number 1, and a fault identifier of the fault 2 is a fault sequence number 2; the fault sequence number increases along with the sequence of the fault occurrence. At this time, when the fault sequence 1 is greater than the fault sequence number 2, the fault sequence number 1 is a first fault identifier; when the fault sequence 2 is larger than the fault sequence number 1, the fault sequence number 2 is a first fault identifier.
It should be noted that, the data stored in the out-of-band controller is persistent data, and even if the server where the out-of-band controller is located is in a cold restart, the data stored in the out-of-band controller still exists in the out-of-band controller.
The specific implementation of S410 includes: one implementation is as follows: the in-band controller sends a request to the out-of-band controller to obtain a first fault identification, and the out-of-band controller sends the first fault identification in response to the request. In another implementation, the out-of-band controller actively sends a first fault identification to the in-band controller.
S420, the in-band controller acquires a second fault identifier from a fault register of the memory.
The memory is DCPMM.
It should be noted that, the role of the fault register in S420 is consistent with the role of the fault register in S320, and for specific description of the role of the fault register in S420, reference may be made to the description of S320, which is not described herein again.
The second fault identifier is a fault identifier of a memory fault with the latest occurrence time in the memory; that is to say, the second failure flag is a failure flag indicating that a failure whose occurrence time is closest to the current time occurs in the memory.
For example, based on the example in S410, it is assumed that the memory has faults 3 and 4 after faults 1 and 2; the fault identification of the fault 1 is a fault serial number 1, the fault identification of the fault 2 is a fault serial number 2, the fault identification of the fault 3 is a fault serial number 3, and the fault identification of the fault 4 is a fault serial number 4, wherein the fault serial numbers are sequentially increased along with the sequence of the faults. At this time, when the fault serial number 3 is greater than the fault serial number 4, the fault serial number 3 is a second fault identifier; and when the fault sequence number 4 is larger than the fault sequence number 3, the fault sequence number 4 is a second fault identifier.
S430, the in-band controller determines whether the memory has newly added fault information according to the first fault identifier and the second fault identifier.
It should be noted that the newly added fault flag is a fault flag indicating a fault that has occurred in the memory and is not collected in the out-of-band controller. That is to say, the occurrence time of the memory fault represented by the newly added fault identifier is longer than the occurrence time of the memory fault represented by the first fault identifier.
The specific implementation manner of S430 is: determining the fault identifier of the memory fault with the occurrence time longer than that of the first fault information in the fault register as a newly added fault identifier according to the difference between the second fault identifier and the first fault identifier, wherein the first fault information is the memory fault represented by the first fault identifier; the method comprises the following specific steps:
when the second failure identifier and the first failure identifier are respectively a second failure serial number and a first failure serial number, the failure serial number of the memory failure which is greater than the first failure serial number and less than or equal to the second failure serial number in the failure register of the memory is used as a newly added failure serial number (namely, a newly added failure identifier).
Illustratively, assuming that neither fault 3 nor fault 4 is collected by the out-of-band controller, the fault sequence number for fault 1 is 1, the fault sequence number for fault 2 is 2, the fault sequence number for fault 3 is 3, and the fault sequence number for fault 4 is 4, where. The first fault is identified as fault sequence number 2 and the second fault is identified as fault sequence number 4. Then, the difference between the failure sequence number 4 and the failure sequence number 2 is 2, that is, the failure sequence number 3 of the failure 3 and the failure sequence number 4 of the failure 4 are determined as a new failure sequence number (i.e., a new failure identifier).
And the newly added fault identification is used for representing the newly added fault information.
And when the newly added fault mark is empty, determining that newly added fault information does not exist in the memory, and executing an ending action by the in-band controller so as to continuously start the server.
When the new failure flag is not empty, it is determined that the new failure information exists in the memory, and the in-band controller executes the following S440.
S440, the in-band controller sends newly added fault information represented by the newly added fault identification to the out-of-band controller.
S450, the out-of-band controller receives the newly added fault information sent by the in-band controller.
It should be noted that the implementation manners of the above S440-S450 are consistent with the implementation manners of the above S340-S350, and for the specific description of the S440-S450, reference may be made to the related description of the above S340-S350, and details are not described herein again.
And S460, the out-of-band controller updates the first fault identifier into a second fault identifier.
Illustratively, based on the example of S330 described above, the out-of-band controller updates the currently stored failure sequence number 2 to the failure sequence number 4. At this time, the first failure flag backed up in the out-of-band controller is a failure sequence number with a failure sequence number of 4.
Compared with the scheme that the collected fault identification information of the memory faults is stored in the out-of-band controller, the memory fault information collection method provided by the embodiment of the application is characterized in that the fault identification of the memory fault collected by the out-of-band controller finally is stored in the out-of-band controller, so that the occupation of storage resources of the out-of-band controller is reduced.
Optionally, with reference to fig. 2, as shown in fig. 6, in the embodiment of the present application, another memory fault information collection method is improved, where the method is applied to a server, where the server includes a memory, an out-of-band controller, and an in-band controller; as shown in fig. 6, the method includes the following steps.
S610, in the operation process of the server, the in-band controller detects the fault of the memory and sends the detected fault information to the out-of-band controller.
The memory is DCPMM.
It should be noted that the implementation manner of S610 is consistent with the implementation manner of S110, and for the specific description of S610, reference may be made to the related description of S110, which is not described herein again.
S620, the in-band controller acquires the starting state of the ADR function of the DCPMM from the DCPMM.
It should be noted that the specific implementation of S620 includes: one implementation is as follows: the in-band controller transmits a request for acquiring an on state of an ADR function of the DCPMM to the DCPMM, and the DCPMM transmits the on state of the ADR function of the DCPMM to the in-band controller in response to the request. In another implementation, the DCPMM actively sends the on state of its ADR function to the in-band controller. The specific implementation manner of the in-band controller acquiring the on state of the ADR function of the DCPMM is not limited in the embodiment of the present application.
The on state of the ADR function of the DCPMM includes: open or close.
It should be noted that: when the ADR function of the DCPMM is in an on state, and when a downtime fault occurs in the server, which causes a hot restart of the server, the ADR function of the DCPMM switches the restart mode of the server from the hot restart to a cold restart, that is, when the ADR function of the DCPMM is in an on state, the server is only in the cold restart. The ADR function of DCPMM does not switch the restart mode of the server when the ADR function of DCPMM is off.
When the ADR function of the DCPMM is in the on state, the in-band controller performs S630 and S650 described below.
When the ADR function of the DCPMM is in the off state, the in-band controller performs the following S640-S650.
S630, in the process of restarting the server, if the starting mode of the server is cold restarting, the in-band controller determines whether the memory has newly added failure information.
It should be understood that the above-mentioned memory DCPMM, when the ADR function is in the on state, the cold restart of the server occurs, including: a server normally cold restart (e.g., a user manually powering off the server) and a server cold restart due to a downtime failure (e.g., a memory UCE).
It should be noted that, the implementation manner of determining whether the memory has the new failure information in S630 refers to the specific implementation manner of S310 to S330 or S410 to S430, and details thereof are not described here.
And S640, in the process of restarting the server, if the restarting type of the server is down restarting, the in-band controller determines whether newly added fault information exists in the memory.
It should be noted that, when the server is down and restarted in the case that the ADR function of the DCPMM is in the off state, the startup mode of the server is hot restart.
It should be noted that, the implementation manner of determining whether the memory has the new failure information in S640 refers to the specific implementation manner of S310 to S330 or S410 to S430, and is not described herein again.
Optionally, in the process of restarting the server, if the ADR function of the memory DCPMM is turned off and the restart type of the server is non-downtime restart, the in-band controller executes an end action to continue restarting the server.
And S650, if the newly increased fault information is stored in the memory, the in-band controller sends the newly increased fault information to the out-of-band controller.
The implementation manner in which the in-band controller in S650 sends the new failure information to the out-of-band controller refers to the specific implementation manner of S340-S360 or S440-S460, which is not described herein again.
After the in-band controller completes the above-mentioned S650, the in-band controller executes an end operation to continue the startup of the server.
Optionally, the memory failure information collection methods corresponding to S310-S360 and S410-S460 may also be applied to the operation stage of the server.
In the memory fault information collection method provided by the embodiment of the application, in the process of restarting the server, when the starting mode of the server is cold restart and the ADR function of the DCPMM is on, or when the ADR function of the DCPMM is off and the restarting type of the server is down restart; the in-band controller determines newly-added fault information in the memory and sends the newly-added fault information to the out-band controller so that the out-band controller collects the newly-added fault information; therefore, the memory fault information collection method provided by the embodiment of the application is avoided being executed by the in-band controller when the server is normally and cold restarted under the condition that the ADR function of the DCPMM is closed, and further the computing resources of the in-band controller are saved.
The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. In order to implement the above functions, the memory failure information collection device (e.g., in-band controller) includes a hardware structure and/or a software module corresponding to each function. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
According to the method, the memory failure information collection device (such as an in-band controller) is exemplarily divided into the functional modules, for example, the memory failure information collection device may include the functional modules corresponding to the functional partitions, or may integrate two or more functions into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
Fig. 7 is a schematic structural diagram illustrating a memory failure information collection apparatus; the memory failure information collection device includes: checking unit 701, transmitting/receiving unit 702, and determining unit 703.
The checking unit 701 is used for performing fault detection on the memory in the operation process of the server; the transceiving unit 702 is configured to send the detected fault information to an out-of-band controller; for example, step S110 in the above-described method embodiment is performed.
The determining unit 703 is configured to determine whether new failure information exists in the memory if the start mode of the server is a cold restart mode in the process of restarting the server; for example, step S120 in the above-described method embodiment is performed.
The transceiver unit 702 is further configured to send the newly added failure information to the out-of-band controller if the newly added failure information is stored in the memory; for example, step S130 in the above-described method embodiment is performed.
Optionally, the transceiver 702 is configured to, in a process of restarting the server, determine whether new failure information exists in the memory if the start mode of the server is cold restart and the asynchronous refresh ADR function of the DCPMM is started; for example, step S630 in the above-described method embodiment is performed.
Optionally, in the process of restarting the server, if the ADR function of the DCPMM is turned off and the restart type of the server is down restart, determining whether new failure information exists in the memory; for example, step S640 in the above-described method embodiments is performed.
Optionally, the transceiver 702 is further configured to obtain an activation state of the ADR function from the DCPMM during the server restart process; for example, step S620 in the above-described method embodiment is performed.
Optionally, the transceiver 702 is configured to send non-downtime fault information to the out-of-band controller when a non-downtime fault occurs in the memory during the operation of the server; for example, step S110b in the above-described method embodiment is performed.
Optionally, the transceiver 702 is configured to obtain a first fault identifier from an out-of-band controller; for example, step S310 or S410 in the above-described method embodiment is performed.
The transceiver unit 702 is further configured to obtain a second fault identifier from a fault register of the memory; for example, step S320 or S420 in the above-described method embodiment is performed.
The determining unit 703 is configured to determine whether the memory stores newly added failure information according to the first failure identifier and the second failure identifier; for example, step S330 or S430 in the above-described method embodiment is performed.
For the detailed description of the above alternative modes, reference may be made to the foregoing method embodiments, which are not described herein again. In addition, for explanation and description of beneficial effects of any of the memory fault information collection devices provided above, reference may be made to the corresponding method embodiments described above, and details are not repeated.
The embodiment of the application also provides computer equipment, which comprises a memory, an in-band controller and an out-of-band controller. The embodiment of the application does not limit the concrete form of the computer device. For example, the computer device may specifically be a terminal apparatus, and may also be a network device. Among them, the terminal device may be referred to as: a terminal, user Equipment (UE), terminal device, access terminal, subscriber unit, subscriber station, mobile station, remote terminal, mobile device, user terminal, wireless communication device, user agent, or user device, etc. The terminal device may be a mobile phone, an Augmented Reality (AR) device, a Virtual Reality (VR) device, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), or the like. The network device may specifically be a server or the like. The server may be one physical or logical server, or two or more physical or logical servers sharing different responsibilities and cooperating with each other to realize each function of the server.
The embodiment of the application also provides a server, which comprises a memory, an out-of-band controller and an in-band controller, wherein the memory, the out-of-band controller and the in-band controller are coupled; the memory is used to store computer program code, and the in-band controller cooperates with the memory and the out-of-band controller to perform any of the methods provided in fig. 2-6 above.
An embodiment of the present application further provides a computer device, including: the device comprises a processor and a memory, wherein the processor is connected with the memory. The memory is used for storing computer-executable instructions, and the processor executes the computer-executable instructions stored by the memory, thereby implementing any one of the methods provided in fig. 2-6 above. In one example, the computer device may be a server. In one example, the processor may be an in-band processor.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method performed by any one of the computer devices provided above.
For the explanation and the description of the beneficial effects of any of the computer-readable storage media provided above, reference may be made to the corresponding embodiments described above, and details are not repeated here.
The embodiment of the application also provides a chip, for example, the chip can be a chip containing processor firmware. The chip has integrated therein control circuitry and one or more ports for implementing the functions of the computer device described above. Optionally, the functions supported by the chip may refer to the above, and are not described herein again. Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments may be implemented by a program instructing the associated hardware to perform the steps. The program may be stored in a computer-readable storage medium. The storage medium mentioned above may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an Application Specific Integrated Circuit (ASIC), a microprocessor (DSP), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
The embodiments of the present application also provide a computer program product containing instructions, which when executed on a computer, cause the computer to execute any one of the methods in the above embodiments. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., DVDs), or semiconductor media (e.g., SSDs), among others.
It should be noted that the above devices for storing computer instructions or computer programs provided in the embodiments of the present application, such as, but not limited to, the above memories, computer readable storage media, communication chips, and the like, are all nonvolatile (non-volatile).
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produce, in whole or in part, the processes or functions described in the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), among others.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard drive, read only memory, random access memory, magnetic or optical disk, and the like.
The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A memory fault information collection method is applied to a server, wherein the server comprises a memory, an out-of-band controller and an in-band controller, and the method comprises the following steps:
in the operation process of the server, the in-band controller detects the fault of the memory and sends the detected fault information to the out-of-band controller;
in the process of restarting the server, if the starting mode of the server is cold restart, the in-band controller determines whether newly added fault information exists in the memory, wherein the newly added fault information is fault information which does not exist in the out-of-band controller;
and if the newly increased fault information is stored in the internal memory, the in-band controller sends the newly increased fault information to the out-of-band controller.
2. The method of claim 1,
the memory comprises a persistent memory DCPMM.
3. The method of claim 2, wherein the determining, by the in-band controller, whether the newly added failure information exists in the memory if the startup mode of the server is a cold restart during the server restart comprises:
in the process of restarting the server, if the starting mode of the server is cold restart and the asynchronous refresh ADR function of the DCPMM is started, the in-band controller determines whether newly added fault information exists in the memory.
4. A method according to claim 2 or 3, characterized in that the method further comprises:
in the process of restarting the server, if the ADR function of the DCPMM is closed and the restarting type of the server is downtime restarting, the in-band controller determines whether newly added fault information exists in the memory.
5. The method according to any of claims 2-4, wherein during the server reboot, the method further comprises:
the in-band controller acquires the starting state of the ADR function from the DCPMM; the on state of the ADR function includes on or off.
6. The method according to any one of claims 1 to 5,
the fault information comprises fault identification, wherein the fault identification is used for indicating the sequence of the memory faults; the memory failure is a failure occurred in the memory of the server.
7. The method according to claim 6, wherein during the operation of the server, the in-band controller performs fault detection on the memory and sends detected fault information to the out-of-band controller; the method comprises the following steps:
in the operation process of the server, when the non-downtime fault occurs in the internal memory, the in-band controller sends the non-downtime fault information to the out-of-band controller, so that the out-of-band controller stores the fault identification in the non-downtime fault information.
8. The method of claim 6 or 7, wherein the in-band controller determining whether new defect information is stored in the memory comprises:
the in-band controller acquires a first fault identifier from the out-of-band controller; the first fault identifier is a fault identifier of a memory fault stored in the out-of-band controller;
the in-band controller acquires a second fault identifier from a fault register of the memory; the second fault identifier is a fault identifier of a memory fault occurring in the memory;
and the in-band controller determines whether the memory stores newly-added fault information or not according to the first fault identifier and the second fault identifier.
9. The method of claim 8,
the first fault identifier is a fault identifier of a memory fault with the latest occurrence time in the memory faults stored by the out-of-band controller;
the second fault identifier is a fault identifier of a memory fault which occurs in the latest time in the memory.
10. A server comprising a memory, an out-of-band controller, and an in-band controller, the memory, the out-of-band controller, and the in-band controller being coupled; the in-band controller cooperates with the memory and the out-of-band controller to perform the method of any one of claims 1 to 9.
CN202211304380.9A 2022-10-24 2022-10-24 Memory fault information collection method and device and storage medium Pending CN115904773A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211304380.9A CN115904773A (en) 2022-10-24 2022-10-24 Memory fault information collection method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211304380.9A CN115904773A (en) 2022-10-24 2022-10-24 Memory fault information collection method and device and storage medium

Publications (1)

Publication Number Publication Date
CN115904773A true CN115904773A (en) 2023-04-04

Family

ID=86471841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211304380.9A Pending CN115904773A (en) 2022-10-24 2022-10-24 Memory fault information collection method and device and storage medium

Country Status (1)

Country Link
CN (1) CN115904773A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796229A (en) * 2023-06-21 2023-09-22 北京优特捷信息技术有限公司 Equipment fault detection method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796229A (en) * 2023-06-21 2023-09-22 北京优特捷信息技术有限公司 Equipment fault detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US8910172B2 (en) Application resource switchover systems and methods
US8245077B2 (en) Failover method and computer system
CN102394914A (en) Cluster brain-split processing method and device
CN102508734B (en) Operating system recovery method and intelligent equipment
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
US12050778B2 (en) Data restoration method and related device
CN114116280B (en) Interactive BMC self-recovery method, system, terminal and storage medium
JPH086910A (en) Cluster type computer system
US20090138757A1 (en) Failure recovery method in cluster system
CN112631820A (en) Fault recovery method and device of software system
US20060212754A1 (en) Multiprocessor system
CN108512753B (en) A method and device for message transmission in a cluster file system
CN116724297A (en) Fault processing method, device and system
CN115904773A (en) Memory fault information collection method and device and storage medium
CN111342986B (en) Distributed node management method and device, distributed system and storage medium
CN102073523A (en) Method and device for implementing software version synchronization
CN116668269A (en) Arbitration method, device and system for dual-activity data center
CN106972963B (en) Service module starting control method and starting control method after crash restart
WO2022218346A1 (en) Fault processing method and apparatus
CN116668335A (en) Cluster service processing method, server and system
CN115391106A (en) Method, system and device for pooling backup resources
CN115794456A (en) PCIe link repair method and device and computing equipment
CN107480004B (en) Fault recovery method and device and computer equipment
CN111782515A (en) Web application state detection method and device, server and storage medium
CN118796515B (en) Application and message queue high-availability connection service system, method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination