CN115509786A - Method, device, equipment and medium for reporting fault - Google Patents
Method, device, equipment and medium for reporting fault Download PDFInfo
- Publication number
- CN115509786A CN115509786A CN202211190582.5A CN202211190582A CN115509786A CN 115509786 A CN115509786 A CN 115509786A CN 202211190582 A CN202211190582 A CN 202211190582A CN 115509786 A CN115509786 A CN 115509786A
- Authority
- CN
- China
- Prior art keywords
- fault
- memory
- reporting
- register
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 230000001960 triggered effect Effects 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims 1
- 230000007246 mechanism Effects 0.000 description 11
- 238000012937 correction Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000037924 multicystic encephalomalacia Diseases 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- VMSRVIHUFHQIAL-UHFFFAOYSA-N sodium;dimethylcarbamodithioic acid Chemical compound [Na+].CN(C)C(S)=S VMSRVIHUFHQIAL-UHFFFAOYSA-N 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1008—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
- G06F11/1044—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application discloses a method, a device, equipment and a medium for reporting faults, and relates to the technical field of computers. The method is applied to a CPU (central processing unit) with a memory control chip, and comprises the following steps: reading data from the data register corresponding to each memory channel through the register offset address; judging whether the size of a representing ECC error counter register in each data register is 0 or not; if not, determining that the memory channel has no fault; if so, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, and when the fault occurs, the fault is reported in a log mode through a BMC (baseboard management controller), so that the server can actively report the ECC fault of the memory in the log mode, determine the fault type through the log and timely maintain the fault.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for reporting a fault.
Background
At present, the state is popularizing the home-made credit creation server vigorously, and the server is in the process of high-speed development. The processor S2500 supporting a FT two-way server developed by a FT CPU manufacturer based on ARM8.0 is a primary product, and the server is a Central Processing Unit (CPU) that lacks a RAS (Machine Check Architecture) fault-tolerant active alarm memory fault mechanism, wherein a memory ECC (Error Checking and Correcting) fault is mainly classified into the following two types, one of which is a memory CE alarm and belongs to a correctable Error, and a memory controller of the CPU that has the problem corrects the Error and upper software does not sense the Error; the second is a memory UCE alarm, which belongs to an uncorrectable error, and the occurrence of the problem means that the kernel breakdown of the upper OS causes the down or the ramming of the host. When the server has an error (memory UCE alarm) which cannot be corrected by the ECC, the host computer is down and tamped. However, since the CPU of the server integrates a chip (memory control chip) for ECC check function, the server cannot actively report the memory ECC fault, cannot determine the type of the fault, and cannot perform maintenance in time.
In view of the above problems, how to actively report the memory ECC fault is a problem that those skilled in the art endeavor to solve.
Disclosure of Invention
The invention aims to provide a method, a device, equipment and a medium for reporting faults, which are used for actively reporting memory ECC faults in a log mode, determining the fault type through the log and timely maintaining the faults.
In order to solve the above technical problem, the present application provides a method for reporting a fault, which is applied to a CPU having a memory control chip, and includes:
reading data from the data register corresponding to each memory channel through the register offset address;
judging whether the size of a representing ECC error counter register in each data register is 0 or not;
if not, determining that the memory channel has no fault;
if yes, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm.
Preferably, reading data from the data register corresponding to each memory channel by the register offset address includes:
writing the memory into an address register according to the offset address of the register, and setting a signal representing the read operation of each memory channel to be 1;
data is read from the data register in accordance with a read operation.
Preferably, reading data from the data register according to the read operation comprises:
reading data according to the memory base address and the offset address of each memory channel
Preferably, reporting the fault in a log manner by the BMC includes:
judging whether the offset address is larger than 0;
if yes, determining to trigger the UCE fault of the memory, and reporting the UCE fault of the memory in a log mode;
if not, ending.
Preferably, before determining whether the offset address is greater than 0, the method further includes:
and performing address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Preferably, the logging reporting of memory UCE failures includes:
and reporting the fault of the memory UCE at the position of the CPUn _ dimmm _ n in a log mode.
Preferably, after the fault is reported in a log manner through the BMC, the method further includes:
and outputting prompt information representing that the fault is reported successfully.
In order to solve the above technical problem, the present application further provides a device for reporting a fault, which is applied to a CPU having a memory control chip, and includes:
the first reading module is used for reading data from the data register corresponding to each memory channel through the register offset address;
the first judgment module is used for judging whether the size of the representing ECC error counter register in each data register is 0 or not;
if not, triggering a first determining module for determining that the memory channel has no fault;
if yes, a second determining module is triggered and used for determining that the memory channel has faults and reporting the faults in a log mode through the BMC, wherein the faults are divided into memory UCE faults and memory CE alarms.
In addition, the device also comprises the following modules:
reading data from the data register corresponding to each memory channel by the register offset address comprises:
the write-in module is used for writing in the address register according to the register offset address and setting a signal representing the read operation of each memory channel to be 1;
and the second reading module is used for reading data from the data register according to the reading operation.
Reading data from the data register according to a read operation includes:
the third reading module is configured to, according to the memory base address and the offset address of each memory channel, read data and report a fault in a log manner by the BMC, where the reporting includes:
the second judging module is used for judging whether the offset address is larger than 0;
if yes, triggering a third determining module, determining to trigger the UCE fault, and reporting the UCE fault in a log mode; if not, ending.
Before determining whether the offset address is greater than 0, the method further includes:
and the address conversion module is used for carrying out address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Reporting memory UCE failures in a log manner includes:
and the reporting module is used for reporting the memory UCE fault at the position of the CPUn _ dim _ n in a log mode.
After reporting the fault in a log manner by the BMC, the method further comprises:
and the output module is used for outputting prompt information representing the success of the reported fault.
In order to solve the above technical problem, the present application further provides a device for reporting a fault, including:
a memory for storing a computer program;
a processor for directing a computer program to implement the steps of the method of reporting a fault.
To solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the above-mentioned all fault reporting method.
The method for reporting the fault is applied to a CPU (central processing unit) with a memory control chip, and comprises the following steps of: reading data from the data register corresponding to each memory channel through the register offset address; judging whether the size of a representing ECC error counter register in each data register is 0 or not; if not, determining that the memory channel has no fault; if so, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, when the fault occurs, the fault is reported in a log mode through a BMC, at the moment, the server actively reports the ECC fault of the memory in the log mode, the fault type is determined through the log, and the fault is maintained in time.
The application also provides a device, equipment and a medium for reporting the fault, and the effect is the same as the above.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a method for reporting a fault according to an embodiment of the present application;
fig. 2 is a structural diagram of an apparatus for reporting a fault according to an embodiment of the present disclosure;
fig. 3 is a block diagram of a device for reporting a fault according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a method, a device, equipment and a medium for reporting faults, wherein the method, the device, the equipment and the medium can actively report the ECC faults of a memory in a log mode, determine the fault type through the log and timely maintain the faults.
RAS is mainly implemented by MCA mechanism, AER mechanism. The MCA (Machine Check Architecture) mechanism: the MCA mechanism can report and possibly repair system bus, ECC, parity, cache and other errors, identify fault sources and record fault information in an MC Bank. Through MCA mechanism, correctable errors and uncorrectable errors in the CPU can be reported and recorded, and correctable errors of hardware can be corrected. For uncorrectable errors, a warm restart is typically performed. The scope of the MCA includes all modules in the processor, core, uncore and IIO (via IOMCA). AER (IIO Advanced Error Reporting) mechanism: the AER mechanism is responsible for detecting, recording and sending error signals of sub-modules under various IIO modules, and the scope of action comprises all sub-modules under the IIO modules, such as PCIe interfaces, DMI, core logic of the IIO modules, intel VT-d and the like.
RAS mainly refers to the mechanism of MCA for detecting hardware (here, machine means hardware) errors, such as system bus errors, ECC errors, etc. This system is implemented by a number of MSRs (Model Specific registers) that are divided into two parts, one for setup and the other to describe hardware errors that occur. When the CPU detects uncorrectable MCE (Machine Check Error), it will trigger Machine Check Exception, and software will register related function to process the Exception, and in the function, it will collect Error information of MCE by reading MSR, and then restart the system. Certainly, as the generated MCE can be very fatal, the CPU is directly restarted, and no way is available for completing the MCE processing function; even if an uncorrectable MCE is triggered in the MCE processing function, it may cause a direct system restart. The CPU may also detect the correctable MCEs, and when the number of correctable MCEs exceeds a certain threshold, a CMCI (Corrected Machine Check Error Interrupt) may be triggered, and at this time, the software may capture the Interrupt and perform corresponding processing. CMCI is added after MCA, which is an enhancement to MCA, before software can only implement relevant operations by polling correctable MCE-related MSRs. The MCA handles errors in banks and the globally relevant set of registers defines the ability to turn on the MCA. Each BANK is specifically associated with a class of error sources, such as CPU, MEMORY, CACHE, CHIPSET, etc. Each BANK can be controlled individually so that the software can be processed in a specific way for each BANK. Since the MCA samples errors in time windows, it is possible to find more than one error at the end of each sample, but only trigger an interrupt or exception, so it is necessary to poll all BANKs when the software is processing to ensure that each generated error can be handled. The RAS characteristics of the memory mainly include Single Device Data Correction, SDDC, double Device Data Correction, DDDC, adaptive Double DRAM Device Correction, ADDDC. These features are developed on an ECC basis.
PCIe Spec defines two error reporting levels. The first is basically that all PCIe devices need supported functionality. The second is optional, and a special set of registers to provide more, more detailed Error information for software to locate errors and analyze causes, called Advanced Error Reporting (AER). In the basic error reporting mechanism, there are two sets of associated configuration registers (in configuration space), which are: PCI-compatible Registers (PCI-compatible Registers); additional Registers (PCI Express Capability Registers) in the PCIe bus are provided. In the advanced error reporting mechanism (AER), a dedicated set of configuration registers (in configuration space) is again used. More error information can be obtained by the AER, and the software is helped to locate the error source and analyze the error reason. Errors of the PCIe bus can be classified into (Correctable Errors) and Uncorrectable Errors (uncarreble Errors). Wherein correctable errors may be automatically recognized by hardware and automatically corrected or recovered. Uncorrectable errors are in turn classified as Non-Fatal (Non-total) and Fatal (total). Non-fatal errors are typically handled directly by Device Specific Software (Device driver) and the Link (Link) is recoverable, even if data on the Link is recovered (without loss of data). Fatal errors can only be handled by System Software (System Software) and generally require operations such as reset, so data on the link must be lost. The PCIe bus has three error reporting modes, which are: completion: returning error information to the Request through a status bit in Completion; poisoned Packet (also known as Error Forwarding): informing the receiving end that the Data Payload of the current TLP has been damaged; error Message: reporting the error information to the host.
ECC is a technology applied to memory banks that enables "error checking and correction," i.e., a memory bank that implements an error checking and correcting technology. Before the advent of ECC techniques, another technique, parity, was most commonly used in memory. In a digital circuit, the smallest unit of data is called a "bit" and also called a "data" bit "; "bit" is also the minimum unit in the memory, and it is a signal representing data high and low level by "1" and "0"; in digital circuits 8 consecutive bits are a byte (byte), each byte in memory without "parity" is only 8 bits; if a bit of the error is stored, the corresponding data stored in the error is changed to cause an error of the application program, and a bit is additionally added to each byte (8 bits) in the memory with parity for error detection, for example, a byte stores a value (1, 0, 1), each bit is added (1 +0+1+ 5). If the result is odd, the parity bit is defined as 1 for even parity, otherwise, the parity bit is 0; for odd parity, the opposite is true. When the CPU returns to read the stored data, it will add the data stored in the previous 8 bits again, and the calculation result is consistent with the check bit. When the CPU finds that the two are different, it will try to correct the errors. However, parity has a disadvantage that when the memory finds that a certain data bit has an error, it is not necessarily able to determine which bit is in which the error can be corrected, so the main function of the memory with Parity is only "find the error" and cannot correct some simple errors.
The ECC is characterized in that: ECC can tolerate errors in memory; the error can be corrected, so that the system can continuously and normally operate without interruption caused by the error; the ECC has the capability of automatic correction, and can detect the error bits that cannot be detected by Parity and correct the error.
In order that those skilled in the art will better understand the disclosure, the following detailed description is given with reference to the accompanying drawings.
First, this method is applied to a home server, and is mainly applied to a CPU of the soar S2500 in the home server. A memory control chip is integrated in the CPU of the soar S2500, and the memory control chip is used to implement the relevant function of ECC check. And wherein the CPU is typically provided with 8 channels, each channel being named LUM0, LUM1, LUM2, LUM3, LUM4, LUM5, LUM6, LUM7, respectively.
Fig. 1 is a flowchart of a method for reporting a fault according to an embodiment of the present disclosure. As shown in fig. 1, the method for reporting a fault is applied to a CPU having a memory control chip, and includes:
s10: and reading data from the data register corresponding to each memory channel through the register offset address.
S11: and judging whether the size of the representing ECC error counter register in each data register is 0 or not.
If not, the process proceeds to step S12: and determining that the memory channel does not fail.
If yes, the process proceeds to step S13: and determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC.
The faults are divided into a memory UCE fault and a memory CE alarm.
It should be noted that, a plurality of and various registers exist in the memory control chip, including at least a data register and an address register. And each memory channel corresponds to a data register and an address register respectively. The address register may be represented as LMU _ ADDR _ REG at this time; the DATA register is denoted as LMU _ DATA _ REG.
The fault reporting method is applied to a CPU (central processing unit) with a memory control chip and comprises the following steps: reading data from the data register corresponding to each memory channel through the register offset address; judging whether the size of a representing ECC error counter register in each data register is 0 or not; if not, determining that the memory channel has no fault; if so, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, when the fault occurs, the fault is reported in a log mode through a BMC, at the moment, the server actively reports the ECC fault of the memory in the log mode, the fault type is determined through the log, and the fault is maintained in time.
Reading data from the data register corresponding to each memory channel by the register offset address comprises the following steps:
writing the memory into an address register according to the offset address of the register, and setting a signal representing the read operation of each memory channel to be 1;
data is read from the data register in accordance with a read operation.
Wherein the signal characterizing the read operation is denoted BIT28. Then BIT28 of 8 memory channels of the CPU needs to be set to all 1 at this time, and the data in the data register can be read at this time. When reading data, it can be read according to the following relationship:
LMU _ ADDR _ REG = base address +0x0;
LMU _ DATA _ REG = base address +0x8;
wherein, since the CPU is generally provided with 8 channels, 0x0 to 0x8 represent offset addresses of 8 channels. And, reading data from the data register according to the read operation comprises: and reading data according to the memory base address and the offset address of each memory channel.
Further, logging faults by the BMC includes:
judging whether the offset address is larger than 0;
if yes, determining to trigger the UCE fault of the memory, and reporting the UCE fault of the memory in a log mode;
if not, ending.
Table 1 is a table of correspondence between a register and a base address, where the correspondence between the register and the base address is as shown in table 1:
table 1 is a table of the correspondence between registers and base addresses
On the basis of the foregoing embodiment, as a more preferred embodiment, before determining whether the offset address is greater than 0, the method further includes:
and performing address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Meanwhile, the fault of the memory UCE at the position of the CPUn _ dimmm _ n is reported in a log mode. The log is a sel log.
Wherein the preset register is ECCERRCNT. Table 2 is an ECC register list, as shown in table 2:
TABLE 2 ECC register List
Name of register | Size (bits) | Offset of | Description of the preferred embodiment |
ECCCFG0 | 32 | 0x70 | ECC configuration register 1 |
ECCCFG1 | 32 | 0x74 | ECC configuration register 2 |
ECCSTAT | 32 | 0x78 | ECC status register |
ECCCLR | 32 | 0x7c | ECC clears register |
ECCERRCNT | 32 | 0x80 | ECC error counter register |
ECCCADDR0 | 32 | 0x84 | ECC corrects error address register 0 |
ECCCADDR1 | 32 | 0x88 | ECC correction error address register 1 |
ECCBITMASK0 | 32 | 0x98 | ECC correction data mask register 0 |
ECCBITMASK1 | 32 | 0x9c | ECC correction data mask register 1 |
ECCBITMASK2 | 32 | 0xa0 | ECC correction data mask register 2 |
ECCUADDR0 | 32 | 0xa4 | ECC uncorrected error address register 0 |
ECCUADDR1 | 32 | 0xa8 | ECC uncorrected error address register 1 |
On the basis of the above embodiment, as a more preferred embodiment, after the fault is reported in a log manner by the BMC, the method further includes:
and outputting prompt information for indicating that the reported fault is successful. And the operation and maintenance personnel can find the abnormal work of the server according to the prompt information, check the sel log of the BMC, and replace the memory channel with the memory fault, wherein the replacement specifically comprises the following steps: and replacing the corresponding slot memory to remove the fault.
It should be noted that, when the prompt message is expressed in a text form, the prompt message may be expressed as a "success" word; when the prompt information is represented in the form of a data string, the data string may be 1 bit, 2 bits, 4 bits, 8 bits, etc., and may be sequentially represented as "1", "10", "1100", "00100111" according to the above mentioned order, it should be noted that the above mentioned representation methods for the prompt information are only some of many embodiments, and no limitation is made on the representation method for the prompt information, and in addition, the data string may be converted into a decimal numeric value to determine whether the numeric value exceeds a preset value, and when the numeric value exceeds the preset value, the prompt information is output; the number of 0 and 1 in the data string can be counted, and when the number of 1 is more than 0, the prompt message is output; and whether the number of 0 or 1 in the data string exceeds the preset number or not can be judged, and if the number exceeds the preset number, the prompt message is output. The above-mentioned embodiments do not limit the hint information in the present application, and the embodiments thereof may be determined according to implementation scenarios.
In the above embodiments, the method for reporting a fault is described in detail, and the present application also provides embodiments corresponding to the apparatus for reporting a fault. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.
Fig. 2 is a structural diagram of a device for reporting a fault according to an embodiment of the present application. As shown in fig. 2, the present application further provides a device for reporting a fault, which is applied to a CPU having a memory control chip, and includes:
a first reading module 20, configured to read data from a data register corresponding to each memory channel through a register offset address;
a first determining module 21, configured to determine whether a size of a register representing an ECC error counter in each data register is 0;
if not, triggering a first determining module 22 for determining that the memory channel has not failed;
if yes, a second determining module 23 is triggered, and is configured to determine that the memory channel fails, and report the failure in a log manner through the BMC, where the failure is divided into a memory UCE failure and a memory CE alarm.
The apparatus for reporting a fault includes: the first reading module is used for reading data from the data register corresponding to each memory channel through the register offset address; the first judgment module is used for judging whether the size of the representing ECC error counter register in each data register is 0 or not; if not, triggering a first determining module for determining that the memory channel has no fault; if yes, a second determining module is triggered and used for determining that the memory channel has faults and reporting the faults in a log mode through the BMC, wherein the faults are memory UCE faults and memory CE alarms. The data of each memory channel is read in an indirect addressing mode, whether a fault occurs or not is determined by representing the size of an ECC error counter register, and when the fault occurs, the fault is reported in a log mode through a BMC (baseboard management controller), so that the server can actively report the ECC fault of the memory in the log mode, determine the fault type through the log and timely maintain the fault.
In addition, the device also comprises the following modules:
reading data from the data register corresponding to each memory channel by the register offset address comprises:
the write-in module is used for writing in the address register according to the register offset address and setting the signal representing the read operation of each memory channel to be 1;
and the second reading module is used for reading data from the data register according to the reading operation.
Reading data from the data register according to a read operation includes:
the third reading module is configured to, according to the memory base address and the offset address of each memory channel, read data and report a fault in a log manner by the BMC, where the reporting includes:
the second judging module is used for judging whether the offset address is larger than 0;
if yes, triggering a third determining module, determining to trigger the UCE fault, and reporting the UCE fault in a log mode; if not, the process is ended.
Before determining whether the offset address is greater than 0, the method further includes:
and the address conversion module is used for performing address conversion on the registers corresponding to the memory channels according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
Reporting memory UCE failures in a log manner includes:
and the reporting module is used for reporting the memory UCE fault at the position of the CPUn _ dim _ n in a log mode.
After the fault is reported in a log mode through the BMC, the method further comprises the following steps:
and the output module is used for outputting prompt information representing the success of the reported fault.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
Fig. 3 is a structural diagram of a device for reporting a fault according to an embodiment of the present application, where as shown in fig. 3, the device for reporting a fault includes:
a memory 30 for storing a computer program;
a processor 31 for implementing the steps of the method of reporting faults as mentioned in the above embodiments when executing the computer program.
The device for reporting faults provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.
The processor 31 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 31 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 31 may also include a main processor and a coprocessor, where the main processor is a processor, also called a CPU, for processing data in an awake state; a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 31 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 31 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
In some embodiments, the device for reporting faults may further comprise a display screen, an input output interface, a communication interface, a power source, and a communication bus.
Those skilled in the art will appreciate that the configuration shown in fig. 3 does not constitute a limitation of the apparatus for reporting faults and may include more or fewer components than those shown.
The device for reporting the fault provided by the embodiment of the application comprises a memory 30 and a processor 31, wherein the processor 31 can realize the method for reporting the fault when executing the program stored in the memory 30.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (Read-Only Memory), a ROM, a Random Access Memory (RAM), a magnetic disk, or an optical disk.
A method, an apparatus, a device and a medium for reporting a fault provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, without departing from the principle of the present application, the present application can also make several improvements and modifications, and those improvements and modifications also fall into the protection scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
Claims (10)
1. A method for reporting faults is applied to a CPU (central processing unit) with a memory control chip, and comprises the following steps:
reading data from the data register corresponding to each memory channel through the register offset address;
judging whether the size of a characterization ECC error counter register in each data register is 0 or not;
if not, determining that the memory channel has no fault;
if yes, determining that the memory channel has a fault, and reporting the fault in a log mode through the BMC, wherein the fault is divided into a memory UCE fault and a memory CE alarm.
2. The method of claim 1, wherein reading data from the data register corresponding to each memory channel via the register offset address comprises:
writing the memory channel into an address register according to the register offset address, and setting a signal representing the read operation of each memory channel to be 1;
reading the data from the data register according to the read operation.
3. The method of reporting a fault as in claim 2 wherein reading the data from the data register in accordance with the read operation comprises:
and reading the data according to the memory base address and the offset address of each memory channel.
4. The method of reporting a fault as claimed in claim 1, wherein the reporting a fault in a log by the BMC comprises:
judging whether the offset address is larger than 0;
if yes, determining to trigger the memory UCE fault, and reporting the memory UCE fault in a log mode;
if not, the process is ended.
5. The method of reporting a failure of claim 4, prior to said determining whether said offset address is greater than 0, further comprising:
and performing address conversion on the register corresponding to each memory channel according to a preset relationship, wherein the preset relationship is the corresponding relationship between the preset register address and the CPU.
6. The method of reporting a fault as in claim 5, wherein the reporting the memory UCE fault in the logged manner comprises:
reporting the memory UCE fault at the position of CPUn _ dim _ n in a log mode, wherein n is a natural number from 0 to 7.
7. The method of reporting a fault as claimed in claim 1, further comprising, after the reporting a fault in a log by the BMC:
and outputting prompt information representing that the fault is successful.
8. An apparatus for reporting a fault, applied to a CPU having a memory control chip, comprising:
the first reading module is used for reading data from the data register corresponding to each memory channel through the register offset address;
the first judgment module is used for judging whether the size of the representing ECC error counter register in each data register is 0 or not;
if not, triggering a first determining module for determining that the memory channel has no fault;
if yes, a second determining module is triggered and used for determining that the memory channel has faults and reporting the faults in a log mode through the BMC, wherein the faults are memory UCE faults and memory CE alarms.
9. An apparatus for reporting a fault, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of reporting a fault as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of reporting a fault according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211190582.5A CN115509786A (en) | 2022-09-28 | 2022-09-28 | Method, device, equipment and medium for reporting fault |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211190582.5A CN115509786A (en) | 2022-09-28 | 2022-09-28 | Method, device, equipment and medium for reporting fault |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115509786A true CN115509786A (en) | 2022-12-23 |
Family
ID=84506047
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211190582.5A Pending CN115509786A (en) | 2022-09-28 | 2022-09-28 | Method, device, equipment and medium for reporting fault |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115509786A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117076182A (en) * | 2023-09-28 | 2023-11-17 | 飞腾信息技术有限公司 | Error reporting method, system on chip, computer equipment and storage medium |
-
2022
- 2022-09-28 CN CN202211190582.5A patent/CN115509786A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117076182A (en) * | 2023-09-28 | 2023-11-17 | 飞腾信息技术有限公司 | Error reporting method, system on chip, computer equipment and storage medium |
CN117076182B (en) * | 2023-09-28 | 2024-01-19 | 飞腾信息技术有限公司 | Error reporting method, system on chip, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115629905B (en) | Memory fault early warning method and device, electronic equipment and readable medium | |
US7971112B2 (en) | Memory diagnosis method | |
US8020053B2 (en) | On-line memory testing | |
US20090150721A1 (en) | Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System | |
US11977744B2 (en) | Memory anomaly processing method and system, electronic device, and storage medium | |
CN102135925B (en) | Method and device for detecting error check and correcting memory | |
JPH1055320A (en) | On-line memory monitoring system and device | |
US11080135B2 (en) | Methods and apparatus to perform error detection and/or correction in a memory device | |
WO2024082844A1 (en) | Fault detection apparatus and detection method for random access memory | |
CN102968353A (en) | Fail address processing method and fail address processing device | |
WO2024041093A1 (en) | Memory fault processing method and related device thereof | |
CN115509786A (en) | Method, device, equipment and medium for reporting fault | |
CN116049249A (en) | Error information processing method, device, system, equipment and storage medium | |
US7447943B2 (en) | Handling memory errors in response to adding new memory to a system | |
US7246257B2 (en) | Computer system and memory control method thereof | |
CN114860487A (en) | Memory fault identification method and memory fault isolation method | |
CN114461436A (en) | Memory fault processing method and device and computer readable storage medium | |
CN115033441A (en) | PCIe equipment fault detection method, device, equipment and storage medium | |
CN114996065A (en) | Memory fault prediction method, device and equipment | |
CN114730607A (en) | Memory fault repairing method and device | |
US20120144245A1 (en) | Computing device and method for detecting pci system errors in the computing device | |
WO2024124862A1 (en) | Server-based memory processing method and apparatus, processor and an electronic device | |
CN105824719A (en) | Method and system for detecting random access memory | |
EP4280064A1 (en) | Systems and methods for expandable memory error handling | |
KR101001071B1 (en) | Method and apparatus of reporting memory bit correction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |