CN113918375A

CN113918375A - Fault processing method and device, electronic equipment and storage medium

Info

Publication number: CN113918375A
Application number: CN202111514178.4A
Authority: CN
Inventors: 陈衍东; 李道童; 韩红瑞
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-01-11
Anticipated expiration: 2041-12-13
Also published as: CN113918375B

Abstract

The application discloses a fault processing method, a fault processing device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: after a server is started, acquiring resource information of components in the server; after the server is abnormally shut down, acquiring fault register information corresponding to components in the server; and positioning the fault component based on the resource information of the component and the corresponding fault register information by using a fault diagnosis rule. According to the fault processing method, after the server is abnormally shut down, the fault register information corresponding to each component in the server is collected, the resource information of each component and the fault register information collected after the abnormal shut down are analyzed by using the preset fault diagnosis rule, so that the fault component is positioned, the shutdown maintenance time is shortened, and the reliability, the availability and the maintainability of the server are enhanced.

Description

Fault processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a fault, an electronic device, and a computer-readable storage medium.

Background

Servers are widely applied to various industries as the core of operation and data storage services, the design complexity of the servers is higher and higher under the pressure of business requirements of various industries, and the number of servers running on line is exponentially and continuously increased. In the long-time uninterrupted operation process of the service calculation load, the condition of the server downtime caused by hardware or software abnormality with small probability is still unavoidable, if statistics is carried out on the downtime rate of 1 per thousand per month on a huge base number, a large number of abnormal server downtime is difficult to process, and in addition, the longer the unscheduled shutdown maintenance time is, the more serious the loss generated to the terminal customer is.

Therefore, how to quickly and accurately realize the accurate positioning of the fault component after the abnormal downtime of the server and shorten the downtime for maintenance is a technical problem to be solved by the technical personnel in the field.

Disclosure of Invention

The present application provides a fault processing method, a fault processing apparatus, an electronic device, and a computer-readable storage medium, which implement accurate positioning of a faulty component after an abnormal shutdown of a server, thereby shortening the shutdown maintenance time.

In order to achieve the above object, the present application provides a fault handling method, including:

after a server is started, acquiring resource information of components in the server;

after the server is abnormally shut down, acquiring fault register information corresponding to components in the server;

and positioning the fault component based on the resource information of the component and the corresponding fault register information by using a fault diagnosis rule.

The positioning of the fault component based on the resource information of the component and the corresponding fault register information by using the fault diagnosis rule comprises the following steps:

respectively generating a positioning result of a fault component based on the resource information of the component and the corresponding fault register information by utilizing a plurality of fault diagnosis rules;

if only one positioning result exists, outputting the positioning result;

if a plurality of positioning results exist, judging whether the plurality of positioning results are consistent; if yes, outputting any one positioning result; if not, generating a weighted value of each positioning result based on the weight of each fault diagnosis rule, and outputting the positioning result with the largest weighted value.

Wherein the using the plurality of fault diagnosis rules to generate the positioning results of the faulty component based on the resource information of the component and the corresponding fault register information respectively comprises:

generating a positioning result of the fault component based on the resource information of the component and the corresponding fault register information by utilizing the fault diagnosis sub-rule corresponding to each component or each fault register under each fault diagnosis rule;

generating a fault weighted value of each component based on the weight of each fault register information and the weight of each component relative to the fault register under each fault diagnosis sub-rule;

and determining the component with the largest fault weighted value as the fault component.

The method comprises the following steps of generating a positioning result of a fault component based on resource information of the component and corresponding fault register information by using the CPU fault diagnosis rule, wherein the fault diagnosis rule comprises a CPU fault diagnosis rule, and the method comprises the following steps:

generating a positioning result according to the state information and the address information of an MC Bank register in the CPU; and the positioning result comprises a CPU fault source, a fault module in the CPU fault source and a fault type.

The method comprises the following steps of generating a positioning result of a fault component based on resource information of the component and corresponding fault register information by using historical fault record diagnosis rules, wherein the fault diagnosis rules comprise historical fault record diagnosis rules, and the method comprises the following steps:

judging whether a fault event related to target fault register information exists in the historical fault record or not; if yes, generating a positioning result based on the target fault register information; and the positioning result comprises a component corresponding to the target fault register information.

The method for generating the positioning result of the fault component based on the resource information of the component and the corresponding fault register information by using the fault time diagnosis rule comprises the following steps:

generating a positioning result based on fault register information generated in a preset time period before abnormal downtime; and the positioning result comprises a component corresponding to the latest fault register information generated in the preset time period.

Wherein, still include:

if each fault diagnosis rule has no corresponding positioning result, outputting a fault log; wherein the fault log includes resource information of the component and corresponding fault register information;

creating a new fault diagnosis rule based on the type of error code in the fault log.

Wherein, still include:

acquiring fault register information of each fault type of each fault component to generate a diagnosis fault tree; the first layer of nodes of the diagnostic fault tree are classified by fault components, and the second layer of nodes are classified by fault types;

and matching fault register information corresponding to each part collected after the server is abnormally shut down based on the diagnosis fault tree to obtain a fault positioning result.

The acquiring fault register information corresponding to the component in the server includes:

and the BMC in the server captures fault register information from the component, or receives fault register information corresponding to the component sent by an auxiliary application program and/or a BIOS (basic input/output system) under an operating system.

Wherein, still include:

collecting fault register information corresponding to components in the server in the operation process of the server;

and obtaining a fault early warning result of the server by utilizing a monitoring and diagnosing rule based on the resource information of the component and the corresponding fault register information.

Obtaining a fault early warning result of the server by using a monitoring and diagnosing rule based on the resource information of the component and the corresponding fault register information, wherein the fault early warning result comprises

If the fault register information of the target memory detects that a correctable error storm occurs or a correctable error occurs in an adjacent reading unit, outputting a fault early warning of the target memory;

and/or outputting the fault early warning of the target CPU if the correctable error number of a first-level cache unit or a second-level cache unit or an instruction pre-storing unit of the target CPU is greater than a first threshold value;

and/or outputting fault early warning of the target hard disk if the number of bad tracks of the target hard disk is greater than a second threshold, or the number of bad blocks is greater than a third threshold, or the read-write error rate is greater than a fourth threshold;

and/or outputting a fault early warning of the target PCIe external plug-in card if the correctable error quantity of the target PCIe external plug-in card is greater than a fifth threshold value;

and/or outputting a replacement prompt of the target component if the residual life of the target component is lower than a sixth threshold.

Wherein, still include:

generating a failure model corresponding to each part based on the fault register information before the part in the server fails;

and matching the fault register information corresponding to each part acquired in the operation process of the server with the failure model corresponding to each part to obtain a failure fault early warning result.

To achieve the above object, the present application provides a fault handling apparatus, comprising:

the acquisition module is used for acquiring resource information of components in the server after the server is started;

the first acquisition module is used for acquiring fault register information corresponding to components in the server after the server is abnormally shut down;

and the positioning module is used for positioning the fault component based on the resource information of the component and the corresponding fault register information by using the fault diagnosis rule.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of the fault handling method as described above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the fault handling method as described above.

According to the scheme, the fault processing method provided by the application comprises the following steps: after a server is started, acquiring resource information of components in the server; after the server is abnormally shut down, acquiring fault register information corresponding to components in the server; and positioning the fault component based on the resource information of the component and the corresponding fault register information by using a fault diagnosis rule.

According to the fault processing method, after the server is abnormally shut down, the fault register information corresponding to each part in the server is collected, the resource information of each part and the fault register information collected after the abnormal shut down are analyzed by using the preset fault diagnosis rule, the fault part is accurately positioned, the shutdown maintenance time is shortened, secondary shut down maintenance caused by part replacement errors is avoided, and the reliability, the usability and the maintainability of the server are enhanced. The application also discloses a fault processing device, an electronic device and a computer readable storage medium, which can also achieve the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of fault handling in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating another fault handling method in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a fault diagnosis tree in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating yet another fault handling method in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a fault handling device in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.

The embodiment of the application discloses a fault processing method, which realizes the positioning of a fault component after the abnormal downtime of a server, thereby shortening the downtime maintenance time.

Referring to fig. 1 and 2, fig. 1 is a flowchart illustrating a fault handling method according to an exemplary embodiment, and fig. 2 is a flowchart illustrating another fault handling method according to an exemplary embodiment. As shown in fig. 1, includes:

s101: after a server is started, acquiring resource information of components in the server;

in this embodiment, when the server is started, the resource information of all the components is collected to a BMC (Baseboard Management Controller). In a specific implementation, the BIOS (Basic Input Output System) may be actively reported to the BMC, or the BMC may be actively grabbed from each component, which is not specifically limited in this embodiment.

The components in this embodiment include, but are not limited to, the following: CPU (central processing Unit), PCH (Platform Controller Hub), FCH (Fusion Controller Hub), PSU (Power Supply Unit), memory, hard disk, motherboard, backplane, fan, PCIe (peripheral component interconnect express) add-in card, etc., PCIe add-in card such as RAID (Redundant Array of Independent Disks) card, SAS (Serial Attached SCSI) card, network card, GPU (graphics processing Unit) card, FPGA (Field Programmable logic Gate Array, Field Programmable Gate Array) card, HBA (Host bus adapter) card, HCA (Host channel adapter) card, etc., information of components including but not limited to Serial bus adapter, PCIe slot information number, PCIe (peripheral component interconnect express) address distribution information), and PCIe (peripheral component interconnect express) address distribution information) card, IO (Input/Output), MMIO (Memory-mapped I/O), Memory-mapped I/O) resource information, and the like.

S102: after the server is abnormally shut down, acquiring fault register information corresponding to components in the server;

in this step, after the server is abnormally down, the fault register information corresponding to the components in the server is collected and summarized to the BMC. The fault register data information in this embodiment includes, but is not limited to: the fault register comprises a CPU fault register, a PCH \ FCH fault register, an AER register of a PCIe external plug-in card, S.M.A.R.T. (Self-Monitoring Analysis And Reporting Technology) information of a memory hard disk, a Status word register of a PSU, a fault register which defines the states of circuit modules on a mainboard/a backboard in CPLD (Complex Programmable logic device) supervision, And the fault register can also comprise the temperature of each part actively grabbed by a BMC (baseboard management controller), the rotating speed of a fan, the Voltage And current of a mainboard VR (Voltage converter), And the like.

As a possible implementation manner, the collecting fault register information corresponding to a component in the server includes: the BMC in the server grabs the fault register information from the component. In specific implementation, the BMC may actively capture fault register information from each component in real time, for example, the CPU or the PCH represents that the abnormal Pin level change of the component triggers the BMC to actively capture the fault register information, or the BMC may passively capture the fault register information from each component, for example, the BMC receives sel information sent by an auxiliary application program under the OS to the BMC through IPMI, or the BMC receives sel information sent by the BIOS to the BMC through IPMI to trigger the BMC to capture an action.

As another possible implementation manner, the collecting fault register information corresponding to a component in the server includes: and receiving fault register information corresponding to the component, which is sent by an auxiliary application program and/or a BIOS (basic input/output System) under the operating system. In specific implementation, the auxiliary application program under the OS may continuously capture the fault register information from each component and report the fault register information to the BMC in real time, or the BIOS may report the fault register information to the BMC through various interrupts caused by abnormality of the relevant component.

As shown in fig. 2, the collected information includes asset resource information of all components, fault register information after the server is abnormally down, that is, fault field data information, and fault register information during the operation of the server, that is, real-time collected data information, which will be described in detail in the next embodiment.

S103: and positioning the fault component based on the resource information of the component and the corresponding fault register information by using a fault diagnosis rule.

In this step, the resource information of each component and the fault register information collected after the abnormal downtime are analyzed by using the fault diagnosis rules in the fault diagnosis rule base, so as to locate the faulty component. The fault diagnosis rule in the present embodiment includes a CPU fault diagnosis rule, a historical fault record diagnosis rule, and a fault time diagnosis rule.

The CPU fault diagnosis rule is that a positioning result is generated according to the state information and the address information of an MC Bank register in the CPU; and the positioning result comprises a CPU fault source, a fault module in the CPU fault source and a fault type. In a specific implementation, the fault registers of the CPUs are searched, the registers (e.g., ierrologingreg, ierrlogingreg of intel x86 CPU) recording the fault sources of the CPUs are used to perform fault source troubleshooting of the CPU records, and which register under which CPU is the fault register recording the CPU fault source which is down is checked. Taking intel x86 CPU as an example, IERRLOGGINGREG and MCERRLOGGINGREG are used to determine the source of CPU failure. Generally, a CPU has a plurality of MC Bank (Machine Check Bank) registers, respectively corresponding to respective modules within the CPU. The MC Bank registers in the CPU comprise a state register, an address register and a MISC register, wherein the state register Bit63 can represent whether the state register is valid, the Bit59 can represent whether the MISC register is valid, the Bit58 can represent whether the address register is valid, and the MSCOD of the state register can represent the fault type recorded by the MC Bank. Under some specific fault types, if the state register, the address register and the MISC register are all valid, the address information can be used for positioning the fault module according to the address space distribution topology. Furthermore, according to the current CPU model and the microcode version used by matching, analyzing the collected CPU fault register to check whether the known problem that the new-version microcode is solved exists, matching the Error Code with the known problem of the microcode, if the matching is successful, outputting the known problem, and proposing to upgrade the microcode for solving.

The historical fault record diagnosis rule is used for judging whether fault events related to target fault register information exist in the historical fault records or not; if yes, generating a positioning result based on the target fault register information; and the positioning result comprises a component corresponding to the target fault register information. In specific implementation, the historical fault records are reviewed, whether fault events related to target fault register information collected in the downtime currently exist is searched, and if the fault events exist, a component corresponding to the target fault register information is determined as a server fault source. For example, if the history fault record includes a memory CE (correctable error) or UCE (uncorrectable error), and the current downtime log records the record data of the fault register information corresponding to the CPU and the memory, the memory is determined as the server fault source. And if the historical fault record has a fault record event of the PCIe external plug-in card, and the current downtime log records the record data of fault register information related to the CPU IIO or fault register information corresponding to the PCIe AER, the PCIe external plug-in card is determined as the fault source of the server.

The fault time diagnosis rule is used for generating a positioning result based on fault register information generated in a preset time period before abnormal downtime; and the positioning result comprises a component corresponding to the latest fault register information generated in the preset time period. In the specific implementation, the fault register information in the fault logs of each component is searched, the closer the fault time is, the higher the reliability of the fault caused by the fault component is, the fault register information with higher reliability in a preset time period before the fault time is sorted according to the recording time, and the component corresponding to the fault register information generated at the latest in the preset time period is determined as the fault component of the server.

As a preferred embodiment, the present step comprises: respectively generating a positioning result of a fault component based on the resource information of the component and the corresponding fault register information by utilizing a plurality of fault diagnosis rules; if only one positioning result exists, outputting the positioning result; if a plurality of positioning results exist, judging whether the plurality of positioning results are consistent; if yes, outputting any one positioning result; if not, generating a weighted value of each positioning result based on the weight of each fault diagnosis rule, and outputting the positioning result with the largest weighted value. For example, when the weight of the fault diagnosis rule 1 is 4, the weight of the fault diagnosis rule 2 is 3, the weight of the fault diagnosis rule 3 is 2, and the positioning results of the fault diagnosis rules 1, 2, and 3 are all a, the positioning result a is output. When the positioning result of the fault diagnosis rule 1 is a and the positioning results of the fault diagnosis rules 2 and 3 are B, the weighted value of the positioning result a is 4, the weighted value of the positioning result B is 3+2=5, and the positioning result B is output. When the positioning result of the fault diagnosis rule 1 is a, the positioning result of the fault diagnosis rule 2 is B, and the positioning result of the fault diagnosis rule 3 is C, the weighted values of the positioning results A, B and C are 4, 3, and 2, respectively, and the positioning result a is output. And when the fault diagnosis rules 1 and 2 do not have a positioning result and the positioning result of the fault diagnosis rule 3 is C, outputting the positioning result C.

For the above fault diagnosis rule, the weight of the CPU fault diagnosis rule is higher than that of the historical fault record diagnosis rule, and the weight of the historical fault record diagnosis rule is higher than that of the fault time diagnosis rule.

In addition, if each fault diagnosis rule has no corresponding positioning result, outputting a fault log; wherein the fault log includes resource information of the component and corresponding fault register information; creating a new fault diagnosis rule based on the type of error code in the fault log. In specific implementation, if none of the fault diagnosis rules can output a positioning result, a fault log is output, the fault log can be manually analyzed, whether a specific error code type in the fault log can directly or indirectly point to a certain fault component or a combination of several error codes can indirectly represent one fault component is detected, if the fault component can be diagnosed through manual analysis and has a clear diagnosis rule, a new fault diagnosis rule is created and introduced into a fault diagnosis rule base after the component fault can reproduce the fault and the diagnosis rule of the fault breakdown is verified and trusted through multiple times of breakdown verification. It can be seen that for a very small number of cases of abnormal diagnosis and failure diagnosis, fault classification and deep data mining analysis can be performed, fault data association is performed after a real fault part is manually analyzed and positioned, after the accuracy of data association of the fault is verified through multiple downtime, a new diagnosis rule is formed and supplemented to a diagnosis rule base, and online application verification is performed, so that a scheme of sustainable optimization of the closed-loop diagnosis rule is formed.

Further, as a preferred embodiment, the generating the positioning result of the faulty component based on the resource information of the component and the corresponding fault register information by using the plurality of fault diagnosis rules respectively includes: generating a positioning result of the fault component based on the resource information of the component and the corresponding fault register information by utilizing the fault diagnosis sub-rule corresponding to each component or each fault register under each fault diagnosis rule; generating a fault weighted value of each component based on the weight of each fault register information and the weight of each component relative to the fault register under each fault diagnosis sub-rule; and determining the component with the largest fault weighted value as the fault component. In a specific implementation, each fault diagnosis rule includes a plurality of fault diagnosis sub-rules for different components or different fault Registers, for example, for a CPU fault diagnosis rule, the fault diagnosis sub-rules may be formulated for Registers such as mca (machine Check architecture), CSR (Configuration Space register), aer (advanced Error report) and the like in a CPU fault source, and for a historical fault record diagnosis rule and a fault time diagnosis rule, the corresponding fault diagnosis sub-rules may be formulated for each component. Rule [ i ] represents the ith fault diagnosis sub-Rule, Reg [ n ] represents the nth fault register, and Part [ j ] represents the fault weighted value of the jth component. Part [ j ] [ n ] represents the weight of the jth component relative to the nth fault register, fault data can be recorded in a Part of registers at the fault moment of a specific component, namely, a Part of registers can record fault data, the rest registers cannot record the fault data, the weight of the component relative to the fault register which records the fault data is larger, and the weight of the component relative to the fault register which does not record the fault data is 0. Weight [ i ] [ n ] represents the Weight for applying the nth fault register information to diagnose under the ith fault diagnosis sub-rule. A certain fault diagnosis sub-rule can be diagnosed by using a plurality of fault register information, and the credibility of each fault register information for positioning the recorded data of a certain fault component is different, so that different fault register information has different weights under the same fault diagnosis sub-rule. Calculating Part [ j ] based on Part [ j ] [ n ] and Weight [ i ] [ n ], and determining a Part corresponding to the maximum value of Part [ j ] as a fault Part. It can be understood that if the output positioning result is inconsistent with the actual fault component, the positioning result can be corrected by adjusting Part [ j ] [ n ] and Weight [ i ] [ n ].

As a preferred embodiment, this embodiment further includes: acquiring fault register information of each fault type of each fault component to generate a diagnosis fault tree; the first layer of nodes of the diagnostic fault tree are classified by fault components, and the second layer of nodes are classified by fault types; and matching fault register information corresponding to each part collected after the server is abnormally shut down based on the diagnosis fault tree to obtain a fault positioning result. In a specific implementation, a diagnostic fault tree is established, as shown in fig. 3, the first Layer of nodes are classified by component faults, including CPU faults, Memory faults, PCIe faults, and the like, and the second Layer of nodes are classified by fault types, for example, the CPU faults include PCU (Package Control Unit) faults, CHA (coherence Agent and Home Agent) faults, iMC (Memory Controller, Integrated Memory Controller) faults, and the like, the Memory faults include DQ (input/output channel) faults, Cell (basic storage Unit) faults, ECC (Error correction Code, Error checking and Correcting) faults, and the PCIe faults include Link faults, TLP (Transaction Layer Packet) faults, DLLP (Data Link Layer Packet) faults, and the like. Each fault type of each fault component corresponds to one set or a plurality of sets of fault register information, for example, PCU fault contains fault register information DATA 1-DATA 1n, fault register information corresponding to each component collected after a server is down is traversed and matched in a diagnostic fault tree, and the fault type of the corresponding component is output if the matching is successful.

According to the fault processing method provided by the embodiment of the application, after the server is abnormally shut down, the fault register information corresponding to each part in the server is collected, the resource information of each part and the fault register information collected after the abnormal shut down are analyzed by using the preset fault diagnosis rule, so that the fault part is accurately positioned, the shutdown maintenance time is shortened, the secondary shut down maintenance caused by a piece changing error is avoided, and the reliability, the availability and the maintainability of the server are enhanced.

The embodiment of the application discloses a fault processing method, and compared with the previous embodiment, the embodiment further optimizes the technical scheme. Specifically, the method comprises the following steps:

referring to fig. 4, a flowchart of yet another fault handling method is shown according to an exemplary embodiment, as shown in fig. 4, including:

s201: collecting fault register information corresponding to components in the server in the operation process of the server;

during the boot process of the server and the operation process after the server enters an Operating System (OS), various failures that may occur in each component may cause the server to be abnormal. Including IERR (internal error), UCE-fault (FATAL fault), UCE-nonFATAL (non-FATAL fault), therefore, the fault register information corresponding to each component needs to be collected and summarized to BMC. The acquisition process in this embodiment is similar to that in the above embodiments, and is not described herein again.

S202: and obtaining a fault early warning result of the server by utilizing a monitoring and diagnosing rule based on the resource information of the component and the corresponding fault register information.

In this step, the resource information of each component and the fault register information collected in real time are analyzed by using the monitoring and diagnosing rules in the monitoring and diagnosing rule base to output a fault early warning result. The monitoring and diagnosing rules in the present embodiment include, but are not limited to, the following rules:

rule 1: and if the fault register information of the target memory detects that a correctable error storm occurs or correctable errors occur in adjacent reading units, outputting a fault early warning of the target memory. It can be understood that the CPU reads data in Cache line units, one Cache line reads 64 bytes, and reads in 8 bursts (read units), each burst reads 64-bit data and 8-bit ECC check bits, and each burst can correct a single bit error occurring in 64 bits, that is, CE, that is, Cache line [ n ] includes burst0, burst1, burst2, burst3, burst4, burst5, burst6, and burst 7. In specific implementation, if a correctable error storm is detected to occur in a certain memory (correctable errors are continuously generated), outputting a fault early warning of the memory; and if the correctable errors are detected to occur in adjacent reading units of a certain memory within a certain time, outputting fault early warning of the memory, for example, if correctable errors exist in Cache line [ n ] burst0 and Cache line [ n ] burst1, or correctable errors exist in Cache line [ n ] burst1 and Cache line [ n ] burst2, or correctable errors exist in Cache line [ n-1] burst7 and Cache line [ n ] burst0, correctable errors exist in Cache line [ n ] burst7 and Cache line [ n +1] burst0, and outputting fault early warning.

Rule 2: and if the correctable error quantity of the first-level cache unit or the second-level cache unit or the instruction pre-storage unit of the target CPU is greater than a first threshold value, outputting the fault early warning of the target CPU. In specific implementation, if the number of the first-level cache units or the second-level cache units or the instruction prefetch units CE of a certain CPU is greater than a first threshold, the fault warning of the CPU is output.

Rule 3: and if the number of bad tracks of the target hard disk is greater than a second threshold, or the bad block count is greater than a third threshold, or the read-write error rate is greater than a fourth threshold, outputting the fault early warning of the target hard disk. In a specific embodiment, the first and second electrodes are,

rule 4: if the correctable error quantity of the target PCIe external plug-in card is larger than a fifth threshold value, outputting a fault early warning of the target PCIe external plug-in card;

rule 5: and if the residual life of the target component is lower than a sixth threshold value, outputting a replacement prompt of the target component. In specific implementation, the life cycle of the component is monitored, when the component leaves a factory, the BMC counts down the life cycle of the component such as the SSD SATA hard disk, the SSD PCIe hard disk and the NVME, when the remaining life is lower than a sixth threshold value, the component replacement prompt is output in a diagnosis mode, and the life cycle of the component is counted down again after the component is replaced.

As a preferred embodiment, this embodiment further includes: generating a failure model corresponding to each part based on the fault register information before the part in the server fails; and matching the fault register information corresponding to each part acquired in the operation process of the server with the failure model corresponding to each part to obtain a failure fault early warning result.

It will be appreciated that failure of a component can lead to an overall failure of the server. Therefore, in this embodiment, fault register information before failure of various types of components such as the board card circuit unit, the CPU, the memory, the external cards, and the like is collected, failure models corresponding to the components are generated, in the operation process of the server, the fault register information corresponding to the components collected in real time is matched with the failure models, and if a failure model successfully matched exists, failure early warning of a fault corresponding to the failure model is output, so that real-time monitoring and early warning are performed before failure of the failed component.

Therefore, the component fault is early-warned based on the preset monitoring and diagnosing rules, and the unplanned shutdown caused by the component fault downtime can be converted into the planned shutdown, so that the downtime probability of the server is reduced.

In the following, a fault handling apparatus provided in an embodiment of the present application is described, and a fault handling apparatus described below and a fault handling method described above may be referred to each other.

Referring to fig. 5, a block diagram of a fault handling apparatus according to an exemplary embodiment is shown, as shown in fig. 5, including:

an obtaining module 501, configured to obtain resource information of a component in a server after the server is powered on;

a first collecting module 502, configured to collect fault register information corresponding to a component in the server after the server is abnormally down;

and a positioning module 503, configured to perform positioning of the faulty component based on the resource information of the component and the corresponding fault register information by using the fault diagnosis rule.

The fault processing device provided by the embodiment of the application acquires the corresponding fault register information of each part in the server after the server is abnormally shut down, and analyzes the resource information of each part and the fault register information acquired after the server is abnormally shut down by using the preset fault diagnosis rule so as to accurately position the fault part, shorten the shutdown maintenance time, avoid secondary shut down maintenance caused by part replacement errors, and enhance the reliability, the availability and the maintainability of the server.

On the basis of the above embodiment, as a preferred implementation, the positioning module 503 includes:

a generation unit configured to generate a positioning result of a faulty component based on the resource information of the component and the corresponding fault register information, respectively, using a plurality of fault diagnosis rules;

the first output unit is used for outputting the positioning result if only one positioning result exists;

the second output unit is used for judging whether the plurality of positioning results are consistent or not if the plurality of positioning results exist; if yes, outputting any one positioning result; if not, generating a weighted value of each positioning result based on the weight of each fault diagnosis rule, and outputting the positioning result with the largest weighted value.

On the basis of the foregoing embodiment, as a preferred implementation, the generating unit is specifically configured to:

On the basis of the foregoing embodiment, as a preferred implementation, the fault diagnosis rule includes a CPU fault diagnosis rule, and the generating unit includes:

the first generation subunit is used for generating a positioning result according to the state information and the address information of the MC Bank register in the CPU; and the positioning result comprises a CPU fault source, a fault module in the CPU fault source and a fault type.

On the basis of the foregoing embodiment, as a preferred implementation, the fault diagnosis rule includes a historical fault record diagnosis rule, and the generating unit includes:

the second generation subunit is used for judging whether a fault event related to the target fault register information exists in the historical fault record or not; if yes, generating a positioning result based on the target fault register information; and the positioning result comprises a component corresponding to the target fault register information.

On the basis of the foregoing embodiment, as a preferred implementation, the fault diagnosis rule includes a fault time diagnosis rule, and the generating unit includes:

the third generation subunit is used for generating a positioning result based on the fault register information generated in the preset time period before the abnormal downtime; and the positioning result comprises a component corresponding to the latest fault register information generated in the preset time period.

On the basis of the above embodiment, as a preferred implementation, the method further includes:

the output module is used for outputting a fault log if each fault diagnosis rule does not have a corresponding positioning result; wherein the fault log includes resource information of the component and corresponding fault register information;

and the creating module is used for creating a new fault diagnosis rule based on the error code type in the fault log.

the first generation module is used for acquiring fault register information of each fault type of each fault component to generate a diagnosis fault tree; the first layer of nodes of the diagnostic fault tree are classified by fault components, and the second layer of nodes are classified by fault types;

and the first matching module is used for matching the fault register information corresponding to each part collected after the server is abnormally shut down based on the diagnosis fault tree to obtain a fault positioning result.

On the basis of the foregoing embodiment, as a preferred implementation manner, the first collecting module 502 is specifically a module that captures the fault register information from the components after the server is abnormally down.

On the basis of the foregoing embodiment, as a preferred implementation manner, the first collecting module 502 is specifically a module that receives fault register information, which is sent by an auxiliary application program and/or a BIOS under an operating system, of the component after the server is abnormally down.

the second acquisition module is used for acquiring fault register information corresponding to components in the server in the operation process of the server;

and the early warning module is used for obtaining a fault early warning result of the server based on the resource information of the component and the corresponding fault register information by utilizing the monitoring and diagnosing rule.

On the basis of the above embodiment, as a preferred implementation manner, the early warning module includes

The first output unit is used for outputting the fault early warning of the target memory if the fault register information of the target memory detects that a correctable error storm exists in the adjacent reading unit;

the second output unit is used for outputting the fault early warning of the target CPU if the correctable error quantity of a primary cache unit or a secondary cache unit or an instruction prestoring unit of the target CPU is greater than a first threshold value;

the third output unit is used for outputting the fault early warning of the target hard disk if the number of bad tracks of the target hard disk is greater than a second threshold, or the number of bad blocks is greater than a third threshold, or the read-write error rate is greater than a fourth threshold;

the fourth output unit is used for outputting the fault early warning of the target PCIe external plug-in card if the correctable error quantity of the target PCIe external plug-in card is greater than a fifth threshold value;

and the fifth output unit is used for outputting a replacement prompt of the target component if the residual life of the target component is lower than a sixth threshold.

the second generation module is used for generating a failure model corresponding to each part based on the information of the fault register before the part in the server fails;

and the second matching module is used for matching the fault register information corresponding to each part acquired in the operation process of the server with the failure models corresponding to each part to obtain a failure fault early warning result.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 6 is a structural diagram of an electronic device according to an exemplary embodiment, as shown in fig. 6, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the fault processing method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.

In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 6.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.

When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of fault handling, comprising:

positioning the fault component based on the resource information of the component and the corresponding fault register information by using a fault diagnosis rule;

if only one positioning result exists, outputting the positioning result;

2. The fault handling method according to claim 1, wherein the generating, by using the plurality of fault diagnosis rules, the positioning result of the faulty component based on the resource information of the component and the corresponding fault register information, respectively, comprises:

3. The fault handling method according to claim 1, wherein the fault diagnosis rule includes a CPU fault diagnosis rule, and the step of generating the positioning result of the faulty component based on the resource information of the component and the corresponding fault register information by using the CPU fault diagnosis rule includes:

4. The fault handling method according to claim 1, wherein the fault diagnosis rule includes a historical fault record diagnosis rule, and the using the historical fault record diagnosis rule to generate the positioning result of the faulty component based on the resource information of the component and the corresponding fault register information respectively comprises:

5. The fault handling method according to claim 1, wherein the fault diagnosis rule includes a fault time diagnosis rule, and the step of generating the positioning result of the faulty component based on the resource information of the component and the corresponding fault register information by using the fault time diagnosis rule comprises:

6. The fault handling method of claim 1, further comprising:

7. The fault handling method of claim 1, further comprising:

8. The method according to claim 1, wherein the collecting fault register information corresponding to the component in the server includes:

9. The fault handling method of claim 1, further comprising:

10. The method according to claim 9, wherein the obtaining of the fault warning result of the server based on the resource information of the component and the corresponding fault register information by using the monitoring and diagnosis rule comprises

11. The fault handling method of claim 9, further comprising:

12. A fault handling device, comprising:

the positioning module is used for positioning the fault component based on the resource information of the component and the corresponding fault register information by utilizing a fault diagnosis rule;

wherein the positioning module comprises:

13. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the fault handling method of any one of claims 1 to 11 when executing the computer program.

14. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the fault handling method according to any one of claims 1 to 11.