CN115509786A - Method, device, equipment and medium for reporting fault - Google Patents

Method, device, equipment and medium for reporting fault Download PDF

Info

Publication number
CN115509786A
CN115509786A CN202211190582.5A CN202211190582A CN115509786A CN 115509786 A CN115509786 A CN 115509786A CN 202211190582 A CN202211190582 A CN 202211190582A CN 115509786 A CN115509786 A CN 115509786A
Authority
CN
China
Prior art keywords
fault
memory
reporting
register
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211190582.5A
Other languages
Chinese (zh)
Inventor
王跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211190582.5A priority Critical patent/CN115509786A/en
Publication of CN115509786A publication Critical patent/CN115509786A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a method, a device, equipment and a medium for reporting faults, and relates to the technical field of computers. The method is applied to a CPU (central processing unit) with a memory control chip, and comprises the following steps: reading data from the data register corresponding to each memory channel through the register offset address; judging whether the size of a representing ECC error counter register in each data register is 0 or not; if not, determining that the memory channel has no fault; if so, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, and when the fault occurs, the fault is reported in a log mode through a BMC (baseboard management controller), so that the server can actively report the ECC fault of the memory in the log mode, determine the fault type through the log and timely maintain the fault.

Description

Method, device, equipment and medium for reporting fault
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for reporting a fault.
Background
At present, the state is popularizing the home-made credit creation server vigorously, and the server is in the process of high-speed development. The processor S2500 supporting a FT two-way server developed by a FT CPU manufacturer based on ARM8.0 is a primary product, and the server is a Central Processing Unit (CPU) that lacks a RAS (Machine Check Architecture) fault-tolerant active alarm memory fault mechanism, wherein a memory ECC (Error Checking and Correcting) fault is mainly classified into the following two types, one of which is a memory CE alarm and belongs to a correctable Error, and a memory controller of the CPU that has the problem corrects the Error and upper software does not sense the Error; the second is a memory UCE alarm, which belongs to an uncorrectable error, and the occurrence of the problem means that the kernel breakdown of the upper OS causes the down or the ramming of the host. When the server has an error (memory UCE alarm) which cannot be corrected by the ECC, the host computer is down and tamped. However, since the CPU of the server integrates a chip (memory control chip) for ECC check function, the server cannot actively report the memory ECC fault, cannot determine the type of the fault, and cannot perform maintenance in time.
In view of the above problems, how to actively report the memory ECC fault is a problem that those skilled in the art endeavor to solve.
Disclosure of Invention
The invention aims to provide a method, a device, equipment and a medium for reporting faults, which are used for actively reporting memory ECC faults in a log mode, determining the fault type through the log and timely maintaining the faults.
In order to solve the above technical problem, the present application provides a method for reporting a fault, which is applied to a CPU having a memory control chip, and includes:
reading data from the data register corresponding to each memory channel through the register offset address;
judging whether the size of a representing ECC error counter register in each data register is 0 or not;
if not, determining that the memory channel has no fault;
if yes, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm.
Preferably, reading data from the data register corresponding to each memory channel by the register offset address includes:
writing the memory into an address register according to the offset address of the register, and setting a signal representing the read operation of each memory channel to be 1;
data is read from the data register in accordance with a read operation.
Preferably, reading data from the data register according to the read operation comprises:
reading data according to the memory base address and the offset address of each memory channel
Preferably, reporting the fault in a log manner by the BMC includes:
judging whether the offset address is larger than 0;
if yes, determining to trigger the UCE fault of the memory, and reporting the UCE fault of the memory in a log mode;
if not, ending.
Preferably, before determining whether the offset address is greater than 0, the method further includes:
and performing address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Preferably, the logging reporting of memory UCE failures includes:
and reporting the fault of the memory UCE at the position of the CPUn _ dimmm _ n in a log mode.
Preferably, after the fault is reported in a log manner through the BMC, the method further includes:
and outputting prompt information representing that the fault is reported successfully.
In order to solve the above technical problem, the present application further provides a device for reporting a fault, which is applied to a CPU having a memory control chip, and includes:
the first reading module is used for reading data from the data register corresponding to each memory channel through the register offset address;
the first judgment module is used for judging whether the size of the representing ECC error counter register in each data register is 0 or not;
if not, triggering a first determining module for determining that the memory channel has no fault;
if yes, a second determining module is triggered and used for determining that the memory channel has faults and reporting the faults in a log mode through the BMC, wherein the faults are divided into memory UCE faults and memory CE alarms.
In addition, the device also comprises the following modules:
reading data from the data register corresponding to each memory channel by the register offset address comprises:
the write-in module is used for writing in the address register according to the register offset address and setting a signal representing the read operation of each memory channel to be 1;
and the second reading module is used for reading data from the data register according to the reading operation.
Reading data from the data register according to a read operation includes:
the third reading module is configured to, according to the memory base address and the offset address of each memory channel, read data and report a fault in a log manner by the BMC, where the reporting includes:
the second judging module is used for judging whether the offset address is larger than 0;
if yes, triggering a third determining module, determining to trigger the UCE fault, and reporting the UCE fault in a log mode; if not, ending.
Before determining whether the offset address is greater than 0, the method further includes:
and the address conversion module is used for carrying out address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Reporting memory UCE failures in a log manner includes:
and the reporting module is used for reporting the memory UCE fault at the position of the CPUn _ dim _ n in a log mode.
After reporting the fault in a log manner by the BMC, the method further comprises:
and the output module is used for outputting prompt information representing the success of the reported fault.
In order to solve the above technical problem, the present application further provides a device for reporting a fault, including:
a memory for storing a computer program;
a processor for directing a computer program to implement the steps of the method of reporting a fault.
To solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the above-mentioned all fault reporting method.
The method for reporting the fault is applied to a CPU (central processing unit) with a memory control chip, and comprises the following steps of: reading data from the data register corresponding to each memory channel through the register offset address; judging whether the size of a representing ECC error counter register in each data register is 0 or not; if not, determining that the memory channel has no fault; if so, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, when the fault occurs, the fault is reported in a log mode through a BMC, at the moment, the server actively reports the ECC fault of the memory in the log mode, the fault type is determined through the log, and the fault is maintained in time.
The application also provides a device, equipment and a medium for reporting the fault, and the effect is the same as the above.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a method for reporting a fault according to an embodiment of the present application;
fig. 2 is a structural diagram of an apparatus for reporting a fault according to an embodiment of the present disclosure;
fig. 3 is a block diagram of a device for reporting a fault according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a method, a device, equipment and a medium for reporting faults, wherein the method, the device, the equipment and the medium can actively report the ECC faults of a memory in a log mode, determine the fault type through the log and timely maintain the faults.
RAS is mainly implemented by MCA mechanism, AER mechanism. The MCA (Machine Check Architecture) mechanism: the MCA mechanism can report and possibly repair system bus, ECC, parity, cache and other errors, identify fault sources and record fault information in an MC Bank. Through MCA mechanism, correctable errors and uncorrectable errors in the CPU can be reported and recorded, and correctable errors of hardware can be corrected. For uncorrectable errors, a warm restart is typically performed. The scope of the MCA includes all modules in the processor, core, uncore and IIO (via IOMCA). AER (IIO Advanced Error Reporting) mechanism: the AER mechanism is responsible for detecting, recording and sending error signals of sub-modules under various IIO modules, and the scope of action comprises all sub-modules under the IIO modules, such as PCIe interfaces, DMI, core logic of the IIO modules, intel VT-d and the like.
RAS mainly refers to the mechanism of MCA for detecting hardware (here, machine means hardware) errors, such as system bus errors, ECC errors, etc. This system is implemented by a number of MSRs (Model Specific registers) that are divided into two parts, one for setup and the other to describe hardware errors that occur. When the CPU detects uncorrectable MCE (Machine Check Error), it will trigger Machine Check Exception, and software will register related function to process the Exception, and in the function, it will collect Error information of MCE by reading MSR, and then restart the system. Certainly, as the generated MCE can be very fatal, the CPU is directly restarted, and no way is available for completing the MCE processing function; even if an uncorrectable MCE is triggered in the MCE processing function, it may cause a direct system restart. The CPU may also detect the correctable MCEs, and when the number of correctable MCEs exceeds a certain threshold, a CMCI (Corrected Machine Check Error Interrupt) may be triggered, and at this time, the software may capture the Interrupt and perform corresponding processing. CMCI is added after MCA, which is an enhancement to MCA, before software can only implement relevant operations by polling correctable MCE-related MSRs. The MCA handles errors in banks and the globally relevant set of registers defines the ability to turn on the MCA. Each BANK is specifically associated with a class of error sources, such as CPU, MEMORY, CACHE, CHIPSET, etc. Each BANK can be controlled individually so that the software can be processed in a specific way for each BANK. Since the MCA samples errors in time windows, it is possible to find more than one error at the end of each sample, but only trigger an interrupt or exception, so it is necessary to poll all BANKs when the software is processing to ensure that each generated error can be handled. The RAS characteristics of the memory mainly include Single Device Data Correction, SDDC, double Device Data Correction, DDDC, adaptive Double DRAM Device Correction, ADDDC. These features are developed on an ECC basis.
PCIe Spec defines two error reporting levels. The first is basically that all PCIe devices need supported functionality. The second is optional, and a special set of registers to provide more, more detailed Error information for software to locate errors and analyze causes, called Advanced Error Reporting (AER). In the basic error reporting mechanism, there are two sets of associated configuration registers (in configuration space), which are: PCI-compatible Registers (PCI-compatible Registers); additional Registers (PCI Express Capability Registers) in the PCIe bus are provided. In the advanced error reporting mechanism (AER), a dedicated set of configuration registers (in configuration space) is again used. More error information can be obtained by the AER, and the software is helped to locate the error source and analyze the error reason. Errors of the PCIe bus can be classified into (Correctable Errors) and Uncorrectable Errors (uncarreble Errors). Wherein correctable errors may be automatically recognized by hardware and automatically corrected or recovered. Uncorrectable errors are in turn classified as Non-Fatal (Non-total) and Fatal (total). Non-fatal errors are typically handled directly by Device Specific Software (Device driver) and the Link (Link) is recoverable, even if data on the Link is recovered (without loss of data). Fatal errors can only be handled by System Software (System Software) and generally require operations such as reset, so data on the link must be lost. The PCIe bus has three error reporting modes, which are: completion: returning error information to the Request through a status bit in Completion; poisoned Packet (also known as Error Forwarding): informing the receiving end that the Data Payload of the current TLP has been damaged; error Message: reporting the error information to the host.
ECC is a technology applied to memory banks that enables "error checking and correction," i.e., a memory bank that implements an error checking and correcting technology. Before the advent of ECC techniques, another technique, parity, was most commonly used in memory. In a digital circuit, the smallest unit of data is called a "bit" and also called a "data" bit "; "bit" is also the minimum unit in the memory, and it is a signal representing data high and low level by "1" and "0"; in digital circuits 8 consecutive bits are a byte (byte), each byte in memory without "parity" is only 8 bits; if a bit of the error is stored, the corresponding data stored in the error is changed to cause an error of the application program, and a bit is additionally added to each byte (8 bits) in the memory with parity for error detection, for example, a byte stores a value (1, 0, 1), each bit is added (1 +0+1+ 5). If the result is odd, the parity bit is defined as 1 for even parity, otherwise, the parity bit is 0; for odd parity, the opposite is true. When the CPU returns to read the stored data, it will add the data stored in the previous 8 bits again, and the calculation result is consistent with the check bit. When the CPU finds that the two are different, it will try to correct the errors. However, parity has a disadvantage that when the memory finds that a certain data bit has an error, it is not necessarily able to determine which bit is in which the error can be corrected, so the main function of the memory with Parity is only "find the error" and cannot correct some simple errors.
The ECC is characterized in that: ECC can tolerate errors in memory; the error can be corrected, so that the system can continuously and normally operate without interruption caused by the error; the ECC has the capability of automatic correction, and can detect the error bits that cannot be detected by Parity and correct the error.
In order that those skilled in the art will better understand the disclosure, the following detailed description is given with reference to the accompanying drawings.
First, this method is applied to a home server, and is mainly applied to a CPU of the soar S2500 in the home server. A memory control chip is integrated in the CPU of the soar S2500, and the memory control chip is used to implement the relevant function of ECC check. And wherein the CPU is typically provided with 8 channels, each channel being named LUM0, LUM1, LUM2, LUM3, LUM4, LUM5, LUM6, LUM7, respectively.
Fig. 1 is a flowchart of a method for reporting a fault according to an embodiment of the present disclosure. As shown in fig. 1, the method for reporting a fault is applied to a CPU having a memory control chip, and includes:
s10: and reading data from the data register corresponding to each memory channel through the register offset address.
S11: and judging whether the size of the representing ECC error counter register in each data register is 0 or not.
If not, the process proceeds to step S12: and determining that the memory channel does not fail.
If yes, the process proceeds to step S13: and determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC.
The faults are divided into a memory UCE fault and a memory CE alarm.
It should be noted that, a plurality of and various registers exist in the memory control chip, including at least a data register and an address register. And each memory channel corresponds to a data register and an address register respectively. The address register may be represented as LMU _ ADDR _ REG at this time; the DATA register is denoted as LMU _ DATA _ REG.
The fault reporting method is applied to a CPU (central processing unit) with a memory control chip and comprises the following steps: reading data from the data register corresponding to each memory channel through the register offset address; judging whether the size of a representing ECC error counter register in each data register is 0 or not; if not, determining that the memory channel has no fault; if so, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, when the fault occurs, the fault is reported in a log mode through a BMC, at the moment, the server actively reports the ECC fault of the memory in the log mode, the fault type is determined through the log, and the fault is maintained in time.
Reading data from the data register corresponding to each memory channel by the register offset address comprises the following steps:
writing the memory into an address register according to the offset address of the register, and setting a signal representing the read operation of each memory channel to be 1;
data is read from the data register in accordance with a read operation.
Wherein the signal characterizing the read operation is denoted BIT28. Then BIT28 of 8 memory channels of the CPU needs to be set to all 1 at this time, and the data in the data register can be read at this time. When reading data, it can be read according to the following relationship:
LMU _ ADDR _ REG = base address +0x0;
LMU _ DATA _ REG = base address +0x8;
wherein, since the CPU is generally provided with 8 channels, 0x0 to 0x8 represent offset addresses of 8 channels. And, reading data from the data register according to the read operation comprises: and reading data according to the memory base address and the offset address of each memory channel.
Further, logging faults by the BMC includes:
judging whether the offset address is larger than 0;
if yes, determining to trigger the UCE fault of the memory, and reporting the UCE fault of the memory in a log mode;
if not, ending.
Table 1 is a table of correspondence between a register and a base address, where the correspondence between the register and the base address is as shown in table 1:
table 1 is a table of the correspondence between registers and base addresses
Figure BDA0003869180300000081
Figure BDA0003869180300000091
On the basis of the foregoing embodiment, as a more preferred embodiment, before determining whether the offset address is greater than 0, the method further includes:
and performing address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Meanwhile, the fault of the memory UCE at the position of the CPUn _ dimmm _ n is reported in a log mode. The log is a sel log.
Wherein the preset register is ECCERRCNT. Table 2 is an ECC register list, as shown in table 2:
TABLE 2 ECC register List
Name of register Size (bits) Offset of Description of the preferred embodiment
ECCCFG0 32 0x70 ECC configuration register 1
ECCCFG1 32 0x74 ECC configuration register 2
ECCSTAT 32 0x78 ECC status register
ECCCLR 32 0x7c ECC clears register
ECCERRCNT 32 0x80 ECC error counter register
ECCCADDR0 32 0x84 ECC corrects error address register 0
ECCCADDR1 32 0x88 ECC correction error address register 1
ECCBITMASK0 32 0x98 ECC correction data mask register 0
ECCBITMASK1 32 0x9c ECC correction data mask register 1
ECCBITMASK2 32 0xa0 ECC correction data mask register 2
ECCUADDR0 32 0xa4 ECC uncorrected error address register 0
ECCUADDR1 32 0xa8 ECC uncorrected error address register 1
On the basis of the above embodiment, as a more preferred embodiment, after the fault is reported in a log manner by the BMC, the method further includes:
and outputting prompt information for indicating that the reported fault is successful. And the operation and maintenance personnel can find the abnormal work of the server according to the prompt information, check the sel log of the BMC, and replace the memory channel with the memory fault, wherein the replacement specifically comprises the following steps: and replacing the corresponding slot memory to remove the fault.
It should be noted that, when the prompt message is expressed in a text form, the prompt message may be expressed as a "success" word; when the prompt information is represented in the form of a data string, the data string may be 1 bit, 2 bits, 4 bits, 8 bits, etc., and may be sequentially represented as "1", "10", "1100", "00100111" according to the above mentioned order, it should be noted that the above mentioned representation methods for the prompt information are only some of many embodiments, and no limitation is made on the representation method for the prompt information, and in addition, the data string may be converted into a decimal numeric value to determine whether the numeric value exceeds a preset value, and when the numeric value exceeds the preset value, the prompt information is output; the number of 0 and 1 in the data string can be counted, and when the number of 1 is more than 0, the prompt message is output; and whether the number of 0 or 1 in the data string exceeds the preset number or not can be judged, and if the number exceeds the preset number, the prompt message is output. The above-mentioned embodiments do not limit the hint information in the present application, and the embodiments thereof may be determined according to implementation scenarios.
In the above embodiments, the method for reporting a fault is described in detail, and the present application also provides embodiments corresponding to the apparatus for reporting a fault. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.
Fig. 2 is a structural diagram of a device for reporting a fault according to an embodiment of the present application. As shown in fig. 2, the present application further provides a device for reporting a fault, which is applied to a CPU having a memory control chip, and includes:
a first reading module 20, configured to read data from a data register corresponding to each memory channel through a register offset address;
a first determining module 21, configured to determine whether a size of a register representing an ECC error counter in each data register is 0;
if not, triggering a first determining module 22 for determining that the memory channel has not failed;
if yes, a second determining module 23 is triggered, and is configured to determine that the memory channel fails, and report the failure in a log manner through the BMC, where the failure is divided into a memory UCE failure and a memory CE alarm.
The apparatus for reporting a fault includes: the first reading module is used for reading data from the data register corresponding to each memory channel through the register offset address; the first judgment module is used for judging whether the size of the representing ECC error counter register in each data register is 0 or not; if not, triggering a first determining module for determining that the memory channel has no fault; if yes, a second determining module is triggered and used for determining that the memory channel has faults and reporting the faults in a log mode through the BMC, wherein the faults are memory UCE faults and memory CE alarms. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, and when the fault occurs, the fault is reported in a log mode through a BMC (baseboard management controller), so that the server can actively report the ECC fault of the memory in the log mode, determine the fault type through the log and timely maintain the fault.
In addition, the device also comprises the following modules:
reading data from the data register corresponding to each memory channel by the register offset address comprises:
the write-in module is used for writing in the address register according to the register offset address and setting the signal representing the read operation of each memory channel to be 1;
and the second reading module is used for reading data from the data register according to the reading operation.
Reading data from the data register according to a read operation includes:
the third reading module is configured to, according to the memory base address and the offset address of each memory channel, read data and report a fault in a log manner by the BMC, where the reporting includes:
the second judging module is used for judging whether the offset address is larger than 0;
if yes, triggering a third determining module, determining to trigger the UCE fault, and reporting the UCE fault in a log mode; if not, the process is ended.
Before determining whether the offset address is greater than 0, the method further includes:
and the address conversion module is used for performing address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Reporting memory UCE failures in a log manner includes:
and the reporting module is used for reporting the memory UCE fault at the position of the CPUn _ dim _ n in a log mode.
After the fault is reported in a log mode through the BMC, the method further comprises the following steps:
and the output module is used for outputting prompt information representing the success of the reported fault.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
Fig. 3 is a structural diagram of a device for reporting a fault according to an embodiment of the present application, where as shown in fig. 3, the device for reporting a fault includes:
a memory 30 for storing a computer program;
a processor 31 for implementing the steps of the method of reporting faults as mentioned in the above embodiments when executing the computer program.
The device for reporting faults provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.
The processor 31 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 31 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 31 may also include a main processor and a coprocessor, where the main processor is a processor, also called a CPU, for processing data in an awake state; a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 31 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 31 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
Memory 30 may include one or more computer-readable storage media, which may be non-transitory. Memory 30 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 30 is at least used for storing a computer program, wherein after being loaded and executed by the processor 31, the computer program can implement the relevant steps of the method for reporting a fault disclosed in any one of the foregoing embodiments. In addition, the resources stored in the memory 30 may also include an operating system, data, and the like, and the storage manner may be a transient storage or a permanent storage. The operating system may include Windows, unix, linux, and the like. The data may include, but is not limited to, a method of reporting a fault, etc.
In some embodiments, the device for reporting faults may further comprise a display screen, an input output interface, a communication interface, a power source, and a communication bus.
Those skilled in the art will appreciate that the configuration shown in fig. 3 does not constitute a limitation of the apparatus for reporting faults and may include more or fewer components than those shown.
The device for reporting the fault provided by the embodiment of the application comprises a memory 30 and a processor 31, wherein the processor 31 can realize the method for reporting the fault when executing the program stored in the memory 30.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (Read-Only Memory), a ROM, a Random Access Memory (RAM), a magnetic disk, or an optical disk.
A method, an apparatus, a device and a medium for reporting a fault provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, without departing from the principle of the present application, the present application can also make several improvements and modifications, and those improvements and modifications also fall into the protection scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for reporting faults is applied to a CPU (central processing unit) with a memory control chip, and comprises the following steps:
reading data from the data register corresponding to each memory channel through the register offset address;
judging whether the size of a characterization ECC error counter register in each data register is 0 or not;
if not, determining that the memory channel has no fault;
if yes, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm.
2. The method of claim 1, wherein reading data from the data register corresponding to each memory channel via the register offset address comprises:
writing the memory channel into an address register according to the register offset address, and setting a signal representing the read operation of each memory channel to be 1;
reading the data from the data register according to the read operation.
3. The method of reporting a fault as in claim 2 wherein reading the data from the data register in accordance with the read operation comprises:
and reading the data according to the memory base address and the offset address of each memory channel.
4. The method of reporting a fault as claimed in claim 1, wherein the reporting a fault in a log by the BMC comprises:
judging whether the offset address is larger than 0;
if yes, determining to trigger the memory UCE fault, and reporting the memory UCE fault in a log mode;
if not, the process is ended.
5. The method of reporting a failure of claim 4, prior to said determining whether said offset address is greater than 0, further comprising:
and performing address conversion on the register corresponding to each memory channel according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
6. The method of reporting a fault as in claim 5, wherein the reporting the memory UCE fault in the logged manner comprises:
reporting the memory UCE fault at the position of CPUn _ dim _ n in a log mode, wherein n is a natural number from 0 to 7.
7. The method of reporting a fault as claimed in claim 1, further comprising, after the reporting a fault in a log by the BMC:
and outputting prompt information representing that the fault is successful.
8. An apparatus for reporting a fault, applied to a CPU having a memory control chip, comprising:
the first reading module is used for reading data from the data register corresponding to each memory channel through the register offset address;
the first judgment module is used for judging whether the size of the representing ECC error counter register in each data register is 0 or not;
if not, triggering a first determining module for determining that the memory channel has no fault;
if yes, a second determining module is triggered and used for determining that the memory channel has faults and reporting the faults in a log mode through the BMC, wherein the faults are memory UCE faults and memory CE alarms.
9. An apparatus for reporting a fault, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of reporting a fault as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of reporting a fault according to any one of claims 1 to 7.
CN202211190582.5A 2022-09-28 2022-09-28 Method, device, equipment and medium for reporting fault Pending CN115509786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211190582.5A CN115509786A (en) 2022-09-28 2022-09-28 Method, device, equipment and medium for reporting fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211190582.5A CN115509786A (en) 2022-09-28 2022-09-28 Method, device, equipment and medium for reporting fault

Publications (1)

Publication Number Publication Date
CN115509786A true CN115509786A (en) 2022-12-23

Family

ID=84506047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211190582.5A Pending CN115509786A (en) 2022-09-28 2022-09-28 Method, device, equipment and medium for reporting fault

Country Status (1)

Country Link
CN (1) CN115509786A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076182A (en) * 2023-09-28 2023-11-17 飞腾信息技术有限公司 Error reporting method, system on chip, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076182A (en) * 2023-09-28 2023-11-17 飞腾信息技术有限公司 Error reporting method, system on chip, computer equipment and storage medium
CN117076182B (en) * 2023-09-28 2024-01-19 飞腾信息技术有限公司 Error reporting method, system on chip, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN115629905B (en) Memory fault early warning method and device, electronic equipment and readable medium
US7971112B2 (en) Memory diagnosis method
US8020053B2 (en) On-line memory testing
US20090150721A1 (en) Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System
US11977744B2 (en) Memory anomaly processing method and system, electronic device, and storage medium
CN102135925B (en) Method and device for detecting error check and correcting memory
JPH1055320A (en) On-line memory monitoring system and device
US11080135B2 (en) Methods and apparatus to perform error detection and/or correction in a memory device
WO2024082844A1 (en) Fault detection apparatus and detection method for random access memory
CN102968353A (en) Fail address processing method and fail address processing device
WO2024041093A1 (en) Memory fault processing method and related device thereof
CN115509786A (en) Method, device, equipment and medium for reporting fault
CN116049249A (en) Error information processing method, device, system, equipment and storage medium
US7447943B2 (en) Handling memory errors in response to adding new memory to a system
US7246257B2 (en) Computer system and memory control method thereof
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN114461436A (en) Memory fault processing method and device and computer readable storage medium
CN115033441A (en) PCIe equipment fault detection method, device, equipment and storage medium
CN114996065A (en) Memory fault prediction method, device and equipment
CN114730607A (en) Memory fault repairing method and device
US20120144245A1 (en) Computing device and method for detecting pci system errors in the computing device
WO2024124862A1 (en) Server-based memory processing method and apparatus, processor and an electronic device
CN105824719A (en) Method and system for detecting random access memory
EP4280064A1 (en) Systems and methods for expandable memory error handling
KR101001071B1 (en) Method and apparatus of reporting memory bit correction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination