WO2024148857A1

WO2024148857A1 - Method and apparatus for filtering root cause of server fault, and non-volatile readable storage medium and electronic apparatus

Info

Publication number: WO2024148857A1
Application number: PCT/CN2023/121451
Authority: WO
Inventors: 张帅豪
Original assignee: 苏州元脑智能科技有限公司
Priority date: 2023-01-09
Filing date: 2023-09-26
Publication date: 2024-07-18
Also published as: CN115766402B; CN115766402A

Abstract

The present application relates to the technical field of computers. Disclosed are a method and apparatus for filtering the root cause of a server fault, and a non-volatile readable storage medium and an electronic apparatus. The method for filtering the root cause of a server fault comprises: acquiring a target fault alarm which is generated in a server; according to first alarm information which is carried in the target fault alarm, classifying the target fault alarm, so as to obtain a target alarm type, wherein alarm types of fault alarms comprise: a root cause alarm and an associated alarm, the root cause alarm being used for indicating that a corresponding fault alarm is the root cause of a server fault, and the associated alarm being used for indicating that the corresponding fault alarm is caused by an associated fault alarm that belongs to the root cause alarm; and according to the target alarm type, determining whether to report the target fault alarm. By using the technical solution, problems in the related art, such as the efficiency of repairing a server fault being relatively low, are solved.

Description

Method and device for filtering root causes of server failures, non-volatile readable storage medium, and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to a Chinese patent application filed with the Chinese Patent Office on January 9, 2023, with application number 202310030520.6 and application name “Filtering method and device, storage medium and electronic device for root causes of server failures”, the entire contents of which are incorporated by reference into this application.

Technical Field

The present application relates to the field of computer technology, and in particular, to a method and device for filtering root causes of server failures, a non-volatile readable storage medium, and an electronic device.

Background technique

At present, in the fields of storage, servers, cloud data centers, IT (Information Technology), embedded computing, etc., all intelligent devices rely on the stability of firmware and systems. In the above scenarios, during the development, testing, and customer business operation, when errors or failures occur in the equipment hardware and software systems, the general processing flow needs to go through error detection, fault diagnosis, fault repair, and fault alarm reporting. However, due to the increasing complexity of computer system hardware and software, the system has a high degree of dependency on features and services. After a problem occurs, multiple services and features will cause fault transmission and trigger multiple services and features to repeatedly report fault alarms.

The current status quo has the following obvious defects: 1. It triggers a large number of service tickets, causing customer panic and leading to direct and indirect economic losses for equipment and service providers; 2. After a large number of alarms are reported, customers and services need to manually determine the root cause of the fault and then complete the fault repair based on the root cause; this increases the RTO (Recovery Time Objective)/RPO (Recovery Point Objective).

With regard to the problems in related technologies such as low efficiency in repairing server failures, no effective solutions have been proposed yet.

Summary of the invention

The embodiments of the present application provide a method and device for filtering the root causes of server failures, a non-volatile readable storage medium, and an electronic device, so as to at least solve the problem of low efficiency in repairing server failures in the related art.

According to an embodiment of the present application, a method for filtering root causes of server failures is provided, including:

Get the target fault alarm generated in the server;

The target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;

Determine whether to report the target fault alarm based on the target alarm type.

Optionally, the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:

Searching for an associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field;

In the case where the associated alarm field is used to indicate that the target fault alarm is an associated alarm, determining that the target alarm type is an associated alarm;

In a case where the associated alarm field is used to indicate that the target fault alarm is not an associated alarm, it is determined that the target alarm type is a root cause alarm.

Extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;

The target fault alarm is classified according to the target alarm characteristics to obtain the target alarm type.

Optionally, the target fault alarm is classified according to the target alarm feature to obtain the target alarm type, including:

When the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, the target alarm type is determined to be a related alarm. police;

When the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be a root cause alarm.

Inputting the target fault alarm into the target alarm classification model, wherein the target alarm classification model is obtained by training the initial alarm classification model using the first alarm sample labeled with the root cause alarm and the second alarm sample labeled with the associated alarm type;

Get the target alarm type output by the target alarm classification model.

Optionally, whether to report a target fault alarm is determined according to the target alarm type, including:

When the target alarm type is a root cause alarm, report the target fault alarm;

In the case where the target alarm type is an associated alarm, it is determined whether to report the target fault alarm according to the second alarm information carried by the target fault alarm.

Optionally, determining whether to report the target fault alarm according to the second alarm information carried by the target fault alarm includes:

Obtaining a target association period corresponding to the target fault alarm, wherein the target association period is used to indicate a time interval in which a target associated fault alarm belonging to a root cause alarm having an associated relationship with the target fault alarm is located;

Check whether the target associated fault alarm is obtained within the time range of the target associated cycle before and after the acquisition time of the target fault alarm;

When a target-related fault alarm is found, the target fault alarm is ignored;

If no target-related fault alarm is found, the target fault alarm is reported.

Optionally, before searching whether a target associated fault alarm is obtained within a time range of a target associated period before and after the acquisition time of the target fault alarm, the method further includes one of the following:

Searching for a target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship;

An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.

Optionally, after determining whether to report a target fault alarm according to the target alarm type, the method further includes:

Determine a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm;

When it is detected that the server has reached the recovery time, the target failure alarm is restored.

Optionally, determining a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm includes:

searching for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;

In the case where the restart recovery field is used to indicate that the target fault alarm is recovered after the target device generating the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.

Optionally, when it is detected that the server has reached the recovery time, the target fault alarm is restored, including:

Detect whether the target device has been rebooted;

When it is detected that the target device is restarted and the target device restarts successfully, the target fault alarm is restored.

Optionally, after searching the restart recovery field from the target fault alarm, the method further includes:

In the case where the restart recovery field is used to indicate that the target fault alarm is not restored after the target device generating the target fault alarm is restarted, searching the device identification field from the target fault alarm, wherein the device identification field is used to indicate the target device identification of the target device generating the target fault alarm;

The recovery timing is determined to be a device ID change at the location of the target device.

Detecting the device identification at the location of the target device;

When it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier, the target fault alarm is restored.

Optionally, obtain target fault alarms generated in the server, including:

In the case where a fault is detected in the server, target fault data of the fault is collected;

Locate the alarm source of the fault according to the target fault data to obtain the target alarm source;

Recover from the fault according to the target alarm source;

In the event of a failure in fault recovery, a target fault alarm is generated.

Optionally, target fault data of the fault is collected, including:

Retry the failure a target number of times;

In case of retry failure, collect the initial fault data of the fault;

Eliminate the data that exceeds the target data interval in the initial fault data to obtain reference fault data;

The reference fault data is averaged to obtain the target fault data.

Optionally, locating the alarm source of the fault according to the target fault data to obtain the target alarm source includes:

Obtaining the fault cause corresponding to the target fault data from the fault data and the fault causes having a corresponding relationship as a candidate fault cause;

Find the target fault cause from the candidate fault causes according to the topological relationship of the devices in the server and the target fault data;

A field replaceable unit (FRU) corresponding to the target fault cause in the server is determined as a target alarm source.

Optionally, searching for a target fault cause from candidate fault causes according to the topological relationship of the devices in the server and the target fault data includes:

Find the target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server;

According to the running status of the equipment in the target topology relationship, the candidate fault causes are checked to obtain the target fault cause.

Optionally, restore the fault according to the target alarm source, including:

Obtaining a target recovery process corresponding to a target alarm source from alarm sources and recovery processes having a corresponding relationship;

When the target recovery process is obtained, the target recovery process is executed;

When the target recovery process is not obtained or the target recovery process fails to be executed, it is determined that the fault recovery has failed.

Optionally, generate target fault alarms, including:

Determine whether the target fault data falls within the alarm threshold range;

When the target fault data falls within the alarm threshold range, a target fault alarm is generated.

According to another embodiment of the present application, a device for filtering root causes of server failures is also provided, including:

An acquisition module is configured to acquire a target fault alarm generated in a server;

A classification module is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;

The first determination module is configured to determine whether to report a target fault alarm according to a target alarm type.

According to another aspect of the embodiments of the present application, a non-volatile readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned server failure root cause filtering method when running.

According to another aspect of an embodiment of the present application, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for filtering the root cause of server failure through the computer program.

In an embodiment of the present application, a target fault alarm generated in a server is obtained; the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the associated fault alarm belonging to the root cause alarm; whether to report the target fault alarm is determined according to the target alarm type, that is, firstly the target fault alarm generated in the server is obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the target alarm type includes a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the associated fault alarm belonging to the root cause alarm, and finally, whether to report the target fault alarm is determined according to the target alarm type. Reporting target fault alarms avoids the situation where a large number of related alarm reports reduce the efficiency of server fault repair. The above technical solution solves the problem of low efficiency of server fault repair in related technologies and achieves the technical effect of improving the efficiency of server fault repair.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the present application.

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, for ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative labor.

FIG1 is a schematic diagram of a hardware environment of a method for filtering root causes of server failures according to an embodiment of the present application;

FIG2 is a flow chart of a method for filtering root causes of server failures according to an embodiment of the present application;

FIG3 is a schematic diagram of generating a target fault alarm according to an embodiment of the present application;

FIG4 is a schematic diagram of positioning a target alarm source according to an embodiment of the present application;

FIG5 is a schematic diagram of a target fault alarm database according to an embodiment of the present application;

FIG6 is a structural block diagram of a device for filtering root causes of server failures according to an embodiment of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of this application.

It should be noted that the terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

The method embodiment provided in the embodiment of the present application can be executed in a computer terminal, a device terminal or a similar computing device. Taking running on a computer terminal as an example, FIG1 is a hardware environment diagram of a method for filtering the root cause of a server failure according to an embodiment of the present application. As shown in FIG1 , the computer terminal may include one or more (only one is shown in FIG1 ) processors 102 (the processor 102 may include but is not limited to a microprocessor MCU (MicroController Unit) or a programmable logic device FPGA (Field-Programmable Gate Array) and other processing devices) and a memory 104 configured to store data. In an exemplary embodiment, the above-mentioned computer terminal may also include a transmission device 106 configured to have a communication function and an input and output device 108. It can be understood by those skilled in the art that the structure shown in FIG1 is only for illustration and does not limit the structure of the above-mentioned computer terminal. For example, the computer terminal may also include more or fewer components than those shown in FIG1 , or have a different configuration with the same function as that shown in FIG1 or more functions than those shown in FIG1 .

The memory 104 is configured to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the filtering method for the root cause of server failure in the embodiment of the present application. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, that is, to implement the above method. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The transmission device 106 is configured to receive or send data via a network. The above-mentioned network example may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, referred to as NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 can be a radio frequency (Radio Frequency, referred to as RF) module, which is configured to communicate with the Internet wirelessly.

In this embodiment, a method for filtering root causes of server failures is provided, which is applied to the above-mentioned computer terminal. FIG. 2 is a flow chart of a method for filtering root causes of server failures according to an embodiment of the present application. As shown in FIG. 2 , the process includes the following steps:

Step S202, obtaining a target fault alarm generated in the server;

Step S204: classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;

Step S206: Determine whether to report a target fault alarm according to the target alarm type.

Through the above steps, the target fault alarm generated in the server is first obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type. The target alarm type includes root cause alarm and associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm. Finally, it is determined whether to report the target fault alarm according to the target alarm type, avoiding the situation where a large number of associated alarms are reported, resulting in a reduction in the efficiency of server fault repair. The above technical solution solves the problems of low efficiency of server fault repair in related technologies, and achieves the technical effect of improving the efficiency of server fault repair.

In the technical solution provided in the above step S202, a target fault alarm generated in the server is obtained.

Optionally, in this embodiment, the target fault alarm may be, but is not limited to, an alarm generated for the server regarding any device or hardware abnormality, and the hardware may be, but is not limited to, components or devices such as a mainboard and a chassis.

Optionally, in this embodiment, the above-mentioned multiple devices may constitute a service cluster, and various services are deployed on each device node, and there are also software and hardware dependencies between these services.

In an exemplary embodiment, the target fault alarm generated in the server can be obtained in the following manner but is not limited to: when a fault is detected in the server, the target fault data of the fault is collected; the alarm source of the fault is located according to the target fault data to obtain the target alarm source; the fault is restored according to the target alarm source; and when the fault recovery fails, a target fault alarm is generated.

Optionally, in this embodiment, Figure 3 is a schematic diagram of the generation of a target fault alarm according to an embodiment of the present application. As shown in Figure 3, when a fault is detected in the server, the target fault data of the fault is collected through the hardware collection filtering layer, and the alarm source of the fault is located according to the target fault data through the fault diagnosis filtering layer based on the target fault data to obtain the target alarm source, and then the fault is restored according to the target alarm source through the fault repair filtering layer. If the fault recovery fails, a target fault alarm is generated.

Optionally, in this embodiment, the target fault data may include but is not limited to data perceived by the hardware perception layer collected by the hardware acquisition filtering layer, such as: hardware temperature, voltage, RAS (Reliability, Availability and Serviceability) signals, etc.

Optionally, in this embodiment, as shown in FIG3 , the hardware perception layer starting from the bottom layer provides basic hardware information capabilities of devices, components, motherboards, and chassis; multiple device nodes constitute a business cluster; various services are deployed on each device node, and there are also software and hardware dependencies between these services; above the node chassis management service and the business management service is cluster management (CM, Cluster Mangement). At the cluster management level, it is necessary to aggregate the fault alarm information of each node, and then suppress the alarm through the designed alarm dependency, and suppress the alarm through the root cause alarm filter; it is also possible to suppress the alarm and the root cause alarm through the intelligent reasoning filter; in the above-mentioned root cause alarm filtering framework, the root cause is blocked layer by layer through four filtering levels, and finally the root cause alarm solution is implemented.

In an exemplary embodiment, target fault data of a fault may be collected in the following manner, but is not limited to: retrying the fault a target number of times; if the retry fails, collecting initial fault data of the fault; eliminating data in the initial fault data that exceeds the target data interval to obtain reference fault data; and performing an average operation on the reference fault data to obtain target fault data.

Optionally, in this embodiment, the target number of retries are performed on the fault; in the case of retry failure, the initial fault data of the fault may be collected, but is not limited to, for a fault, multiple retries after failure are reported to the fault diagnosis layer, filtering devices, and avoiding instantaneous faults caused by environmental interference, etc.;

Optionally, in this embodiment, data exceeding the target data interval in the initial fault data is eliminated to obtain reference fault data, which may refer to, but is not limited to, setting a reasonable data interval, and discarding data exceeding the reasonable interval as false values to eliminate instantaneous false values, and not reporting to the fault diagnosis layer;

Optionally, in this embodiment, performing an average operation on the reference fault data to obtain the target fault data may include, but is not limited to, taking the average of the data through an average algorithm and then reporting it to the fault diagnosis layer.

In an exemplary embodiment, the alarm source of the fault can be located according to the target fault data to obtain the target alarm source in the following manner but not limited to: obtaining the fault cause corresponding to the target fault data from the fault data and fault causes with a corresponding relationship as a candidate fault cause; searching for the target fault cause from the candidate fault causes according to the topological relationship of the equipment in the server and the target fault data; and determining the field replaceable unit FRU corresponding to the target fault cause in the server as the target alarm source.

Optionally, in this embodiment, FIG. 4 is a schematic diagram of positioning a target alarm source according to an embodiment of the present application. As shown in FIG. 4 , the fault cause corresponding to the target fault data is obtained from the fault data and the fault cause having a corresponding relationship as a candidate fault cause. For example, the chassis management service hardware collection filter layer of the mainboard A collects the target fault data indicating that the IIC (Inter-Integrated Circuit) sensor D on the FRU N (FRU, Field Replace Unit) fails, and the corresponding candidate fault causes may include:

1. The MCU B IIC controller of the mainboard A is faulty;

2. The IIC1 channel from mainboard A to IIC switch is faulty;

3. IIC switch C chip failure;

4. IIC switch C to IIC 2 channel on FRU N is faulty;

5. FRU N IIC sensor D is faulty.

According to the topological relationship of the devices in the server (MCU B of mainboard A is connected to IIC Switch C through IIC 1, IIC Switch C is connected to sensor D in FRU N through IIC 2, IIC Switch C is connected to sensor E in FRU N through IIC 3, and IIC Switch C is connected to sensor F in FRU M through IIC 4) and target fault data, the target fault cause is found from the candidate fault causes; the field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source.

In an exemplary embodiment, the target fault cause can be found from the candidate fault causes based on the topological relationship of the devices in the server and the target fault data in the following manner, but is not limited to: finding the target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server; and checking the candidate fault causes according to the operating status of the devices in the target topological relationship to obtain the target fault cause.

Optionally, in this embodiment, the candidate fault causes are checked according to the running status of the devices in the target topology relationship to obtain the target fault cause, as shown in FIG4 , for example,

Given the above hardware topology of MCU B, if the hardware acquisition filter layer reports that MCU B fails to access sensor D, sensor E, and sensor F, it is determined that the IIC 1 of mainboard A is faulty, and the mainboard A IIC 1 channel fault is reported. The target fault causes may include:

1. The MCU B IIC controller of the mainboard A is faulty;

2. The IIC1 channel from motherboard A to IIC switch is faulty

3. IIC switch C chip failure.

Given the above hardware topology of MCU B, if the hardware collection filter layer reports that MCU B fails to access one of sensor D, sensor E, and sensor F, and the other two sensors are accessible normally, it is judged as a FRU sensor access failure. The target failure causes may include:

1. IIC switch C to IIC 2 channel on FRU N is faulty;

2. FRU N IIC sensor D is faulty.

In an exemplary embodiment, the fault can be recovered according to the target alarm source in the following manner but is not limited to: obtaining the target recovery process corresponding to the target alarm source from the alarm sources and recovery processes with corresponding relationships; when the target recovery process is obtained, executing the target recovery process; when the target recovery process is not obtained, or the target recovery process fails to execute, determining that the fault recovery has failed.

Optionally, in this embodiment, as shown in FIG3 , the field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source, and the target alarm source obtains the target recovery process corresponding to the target alarm source in the fault repair filter layer. When the target recovery process is obtained, the target recovery process is executed. The fault repair filter layer is responsible for automatically recovering the software and hardware systems that have entered the abnormal state by mistake, so as to avoid abnormal stagnation and expansion of the situation. For example, 1. The state machine enters the abnormal state due to some low-probability triggering reason and cannot complete the normal negotiation, and the device cannot be normally connected to the system. The device can be connected to the system through the retraining mechanism, or by powering off and powering on the endpoint device to restart the training negotiation, thereby improving the device availability and avoiding the generation of alarms; 2. For some IIC buses, due to the instantaneous abnormality of a certain device/environment, the IIC device access fails. The IIC bus can be restored by resetting the IIC device tree and other measures.

In an exemplary embodiment, the target fault alarm may be generated in the following manner, but is not limited to: determining whether the target fault data falls within the alarm threshold range; and generating the target fault alarm if the target fault data falls within the alarm threshold range.

Optionally, in this embodiment, the alarm threshold range needs to be set reasonably, such as the temperature/voltage hysteresis design, to avoid the ping-pong effect of repeated alarms. For example, the alarm value of a certain temperature is 39 degrees Celsius, and the alarm recovery value is set to 37 degrees Celsius. When the actual temperature hovers around 39 degrees Celsius, a stable alarm can be generated without causing repeated alarms/recovery.

In the technical solution provided in the above step S204, the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, wherein the alarm types of the fault alarm include: root cause alarm and associated alarm, the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm.

Optionally, in this embodiment, as shown in Figure 4, if a fault alarm occurs in IIC Switch C, a fault alarm occurs in sensor D, and the fault of IIC Switch C is the root cause of the fault of sensor D, then the fault alarm of IIC Switch C is a root cause alarm, and the fault alarm of sensor D is an associated alarm.

In an exemplary embodiment, the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but not limited to: searching the associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field; when the associated alarm field is used to indicate that the target fault alarm is an associated alarm, determining that the target alarm type is an associated alarm; when the associated alarm field is used to indicate that the target fault alarm is not an associated alarm, determining that the target alarm type is a root cause alarm.

Optionally, in this embodiment, the associated alarm field may include, but is not limited to, an alarm ID (Identity document), wherein the alarm ID may be an alarm type code, which is globally unique, and this field is a unique identity index identification field for distinguishing a certain type of alarm event.

Optionally, in this embodiment, the target alarm type of the target fault alarm can be determined based on the associated alarm field. The target alarm type can indicate whether the target fault alarm has an associated dependency on other alarms. When the target alarm type indicates that the target fault alarm has an associated dependency on other alarms, that is, it is an associated alarm, the target alarm type is determined to be an associated alarm, and it is necessary to determine whether there is a root cause alarm; if the alarm has no dependency and the target fault alarm is not an associated alarm, the target alarm type is determined to be a root cause alarm and can be reported directly.

In an exemplary embodiment, the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but is not limited to: extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate the cause of the target fault alarm, and the first alarm information includes the target alarm feature; classifying the target fault alarm according to the target alarm feature to obtain the target alarm type.

Optionally, in this embodiment, the target fault alarm may be classified by, but is not limited to, determining based on a target alarm feature corresponding to the target fault alarm.

In an exemplary embodiment, the target fault alarm can be classified according to the target alarm feature to obtain the target alarm type in the following manner but not limited to: when the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, the target alarm type is determined to be an associated alarm; when the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be a root cause alarm.

Optionally, in this embodiment, when the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, it may include but is not limited to physical damage to the hardware. In this case, the target alarm type may be determined to be a root cause alarm.

In an exemplary embodiment, the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but not limited to: the target fault alarm is input into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with a root cause alarm and a second alarm sample labeled with an associated alarm type; and the target alarm type output by the target alarm classification model is obtained.

Optionally, in this embodiment, the target alarm classification model may classify the input target fault alarm to determine the target alarm type of the target fault alarm.

In the technical solution provided in the above step S206, whether to report the target fault alarm is determined according to the target alarm type.

Optionally, in this embodiment, whether to report the target fault alarm depends on the target alarm type. In order to avoid generating a large number of related alarms, interfering with the root cause judgment of the system fault, and accelerating the repair efficiency, the target fault alarm with the target alarm type as the root cause alarm can be reported.

In an exemplary embodiment, whether to report a target fault alarm can be determined according to the target alarm type in the following manner but is not limited to: when the target alarm type is a root cause alarm, the target fault alarm is reported; when the target alarm type is an associated alarm, whether to report the target fault alarm is determined according to the second alarm information carried by the target fault alarm.

Optionally, in this embodiment, when the target alarm type is an associated alarm, it is determined whether to report the target fault alarm according to the second alarm information carried by the target fault alarm.

In an exemplary embodiment, it is possible but not limited to determine whether to report the target fault alarm based on the second alarm information carried by the target fault alarm in the following manner: obtain the target association period corresponding to the target fault alarm, wherein the target association period is used to indicate the time interval in which the target associated fault alarm belonging to the root cause alarm that has an association relationship with the target fault alarm is located; search whether the target associated fault alarm is obtained within the time range of the target association period before and after the acquisition time of the target fault alarm; if the target associated fault alarm is found, ignore the target fault alarm; if the target associated fault alarm is not found, report the target fault alarm.

Optionally, in this embodiment, the target association period can be, but is not limited to, the time interval for reporting the root cause of the associated alarm. If the root cause alarm is generated within the association period, the alarm is invalid and does not need to be reported; if the root cause alarm is not generated within the association period, the alarm is effectively reported, wherein the design of the target association period can be based on the time difference between the target associated fault alarm and the target fault alarm that are associated with the root cause alarm. For example, if the maximum possible time difference between the target fault alarm and the target associated fault alarm is 1 minute, the association period can be set to 1 minute. This attribute can be saved in the cluster alarm root cause filtering layer and the event database corresponding to the target fault alarm as an inherent attribute of the target fault alarm. After the associated alarm is reported to the CM, the CM needs to determine whether there is a root cause alarm reported within 1 minute before and after the alarm report. If so, the associated alarm does not need to be reported, and only the root cause alarm is reported.

In an exemplary embodiment, before searching whether a target associated fault alarm is acquired within a time range of a target associated period before and after the acquisition time of the target fault alarm, the method further includes one of the following:

Optionally, in this embodiment, Figure 5 is a schematic diagram of a target fault alarm database according to an embodiment of the present application. As shown in Figure 5, a target associated fault alarm corresponding to the target fault alarm is searched from the fault alarms and associated fault alarms with corresponding relationships. For example, the alarm ID of the target fault alarm is known, and the target associated fault alarm (root cause alarm ID 1 and root cause alarm ID N) corresponding to the target fault alarm (alarm ID) is searched from the fault alarms and associated fault alarms with corresponding relationships.

Optionally, in this embodiment, as shown in FIG5 , based on the design of associated alarms and root cause alarms, the possible root cause alarms of the associated alarms are determined. This attribute is stored in the alarm event database of the cluster alarm root cause filtering layer as an inherent attribute of the alarm. After the target fault alarm is reported to the CM, the CM searches the root cause alarms stored in the database at the alarm root cause filtering layer to find out whether the root cause alarm has been reported. If there is a root cause alarm, there is no need to report the associated alarms, and only the root cause alarms need to be reported. If the associated alarms do not find the root cause alarms, the associated alarms themselves are the root cause, which can be To report.

Optionally, in this embodiment, when designing a target alarm type of a target fault alarm, it is first analyzed whether the target fault alarm is a root cause alarm for the problem, or has an associated dependency on an existing alarm, that is, it may be a fault transmission result generated by other root cause alarms; confirm the attribute of the [Is Alarm Associated] field, and the attribute is saved in the alarm event database of the cluster alarm root cause filter layer as an inherent attribute of the alarm.

In an exemplary embodiment, after determining whether to report a target fault alarm based on the target alarm type, the following methods may be used, but are not limited to: determining a recovery timing corresponding to the target fault alarm based on the third alarm information carried by the target fault alarm; and restoring the target fault alarm when it is detected that the server has reached the recovery timing.

Optionally, in this embodiment, the target fault alarm can be restored in different ways. When it is detected that the server reaches the recovery timing, that is, when the corresponding recovery event is detected, the target fault alarm is restored.

In an exemplary embodiment, the recovery timing corresponding to the target fault alarm can be determined based on the third alarm information carried by the target fault alarm in the following manner, but is not limited to: searching the restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is restored after the target device that generated the target fault alarm is restarted, and the third alarm information includes the restart recovery field; when the restart recovery field is used to indicate that the target fault alarm is restored after the target device that generated the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.

Optionally, in this embodiment, as shown in FIG5 , if the restart recovery field is searched from the target fault alarm, for example, when [Restart Recovery] is “Yes”, the recovery timing is determined to be the restart of the target device. For specific scenarios of device restart/power on and off, it is necessary to determine the attribute of the [Restart Recovery] field, which is stored in the alarm event database of the cluster alarm root cause filter layer as an inherent attribute of the alarm. If the alarm will be restored after restart, the CM needs to report that the alarm has been restored after the device is restarted and update the local database; if the alarm will not be restored after restart, the CM will not report the alarm recovery after the device is restarted.

In an exemplary embodiment, the target fault alarm can be restored when it is detected that the server has reached the recovery time in the following manner but is not limited to: detecting whether the target device has been restarted; and restoring the target fault alarm when it is detected that the target device has been restarted and the target device has been restarted successfully.

Optionally, in this embodiment, it is detected whether a restart operation is executed on the target device. If it is detected that a restart operation is executed on the target device and the target device restarts successfully, it means that a recovery opportunity is detected, and the target fault alarm is restored.

In an exemplary embodiment, after searching for the restart recovery field in the target fault alarm, the following methods may be used but are not limited to: when the restart recovery field is used to indicate that the target fault alarm is not recovered after the target device that generates the target fault alarm is restarted, searching for the device identification field from the target fault alarm, wherein the device identification field is used to indicate the target device identification of the target device that generates the target fault alarm; and determining the recovery timing as the replacement of the device identification at the location of the target device.

Optionally, in this embodiment, as shown in FIG. 5 , for the alarm that [Restart to Recover] is not recovered after restart, it indicates that the alarm is a hardware/equipment failure and is not recovered with the system restart; such failure can only be recovered after the customer/service is replaced, so the CM needs to recover the alarm after receiving the alarm of the change of the unique identification information of the location device (such as SN (series number));

In an exemplary embodiment, the target fault alarm can be restored when it is detected that the server has reached the recovery time in the following manner, but is not limited to: detecting the device identifier at the location of the target device; and restoring the target fault alarm when it is detected that the device identifier at the location of the target device has changed from the target device identifier to the reference device identifier.

Optionally, in this embodiment, when it is detected that the device identifier at the location of the target device is changed from the target device identifier to the reference device identifier, it indicates that the target device has been replaced, and the target fault alarm is restored.

After the above four layers (hardware collection filter layer, fault diagnosis filter layer, fault repair filter layer, alarm root cause filter layer), the management software customer interface can provide real fault alarms to customers/services, and customers/services can perform equipment maintenance based on accurate fault alarms. Through four layers of fault filtering, accurate alarm reports are provided, effectively improving the accuracy and efficiency of services; customers will not panic due to multiple alarms, reducing direct and indirect economic losses of equipment and service providers; faults are detected, diagnosed, repaired, aggregated, root cause diagnosed, and root cause alarmed, ensuring that root cause alarms can be issued after problems occur, improving problem solving efficiency, reducing RTO and RPO, and improving customer satisfaction.

Through the description of the above implementation modes, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by software. The required general hardware platform can be implemented, of course, it can also be implemented through hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present application can essentially or contribute to the prior art in the form of a software product, which is stored in a non-volatile readable storage medium (such as ROM/RAM, disk, CD), including a number of instructions to enable a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of each embodiment of the present application.

FIG6 is a structural block diagram of a device for filtering root causes of server failures according to an embodiment of the present application; as shown in FIG6 , the device comprises:

An acquisition module 602 is configured to acquire a target fault alarm generated in a server;

The classification module 604 is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;

The first determination module 606 is configured to determine whether to report a target fault alarm according to a target alarm type.

Through the above embodiment, the target fault alarm generated in the server is first obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, which includes root cause alarm and associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm. Finally, it is determined whether to report the target fault alarm according to the target alarm type, avoiding the situation where a large number of associated alarms are reported, resulting in a reduction in the efficiency of server fault repair. The above technical solution solves the problems of low efficiency of server fault repair in related technologies, and achieves the technical effect of improving the efficiency of server fault repair.

In an exemplary embodiment, the classification module includes:

A first search unit is configured to search for an associated alarm field from a target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field;

A first determining unit is configured to determine that the target alarm type is an associated alarm when the associated alarm field is used to indicate that the target fault alarm is an associated alarm;

The second determining unit is configured to determine that the target alarm type is a root cause alarm when the associated alarm field is used to indicate that the target fault alarm is not an associated alarm.

In an exemplary embodiment, the classification module includes:

an extraction unit, configured to extract a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;

The classification unit is configured to classify the target fault alarm according to the target alarm feature to obtain the target alarm type.

In an exemplary embodiment, the classification unit is further configured to:

In the case where the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, determining the target alarm type to be a related alarm;

In an exemplary embodiment, the classification module includes:

An input unit is configured to input a target fault alarm into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with a root cause alarm and a second alarm sample labeled with an associated alarm type;

The acquisition unit is configured to acquire the target alarm type output by the target alarm classification model.

In an exemplary embodiment, the first determining module includes:

The reporting unit is configured to report a target fault alarm when the target alarm type is a root cause alarm;

The third determination unit is configured to determine whether to report the target fault alarm according to the second alarm information carried by the target fault alarm when the target alarm type is an associated alarm.

In an exemplary embodiment, the third determining unit is further configured to:

Obtaining a target correlation period corresponding to the target fault alarm, wherein the target correlation period is used to indicate a time interval in which a target correlation fault alarm belonging to a root cause alarm having a correlation relationship with the target fault alarm is located;

When a target-related fault alarm is found, the target fault alarm is ignored;

If no target-related fault alarm is found, the target fault alarm is reported.

In an exemplary embodiment, the apparatus further comprises one of the following:

The first search module is configured to search for a target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship before searching whether the target associated fault alarm is obtained within a time range of a target associated period before and after the acquisition time of the target fault alarm;

The extraction module is configured to extract an associated fault alarm field from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.

In an exemplary embodiment, the apparatus further comprises:

The second determination module is configured to determine the restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm after determining whether to report the target fault alarm according to the target alarm type;

The recovery module is configured to restore the target fault alarm when it is detected that the server reaches the recovery time.

In an exemplary embodiment, the second determining module includes:

A second search unit is configured to search for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;

The fourth determining unit is configured to determine the recovery timing as the restart of the target device when the restart recovery field is used to indicate that the target fault alarm is recovered after the target device that generated the target fault alarm is restarted.

In an exemplary embodiment, the recovery module includes:

A first detection unit is configured to detect whether a restart operation is performed on the target device;

The first recovery unit is configured to recover the target fault alarm when it is detected that the target device is restarted and the target device is restarted successfully.

In an exemplary embodiment, the apparatus further comprises:

A second search module is configured to search the target fault alarm for a device identification field after searching the restart recovery field in the target fault alarm, in the case where the restart recovery field is used to indicate that the target fault alarm is not recovered after the target device generating the target fault alarm is restarted, wherein the device identification field is used to indicate the target device identification of the target device generating the target fault alarm;

The third determining module is configured to determine the recovery timing as a device identifier replacement at a location where the target device is located.

In an exemplary embodiment, the recovery module includes:

A second detection unit is configured to detect a device identifier at a location where a target device is located;

The second restoring unit is configured to restore the target fault alarm when it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier.

In an exemplary embodiment, the acquisition module includes:

A collection unit is configured to collect target fault data of a fault when a fault is detected in the server;

A positioning unit is configured to locate the alarm source of the fault according to the target fault data to obtain the target alarm source;

A third recovery unit is configured to recover the fault according to a target alarm source;

The generating unit is configured to generate a target fault alarm in case of failure of fault recovery.

In an exemplary embodiment, the acquisition unit is configured to:

Retry the failure a target number of times;

In case of retry failure, collect the initial fault data of the fault;

The reference fault data is averaged to obtain the target fault data.

In an exemplary embodiment, the positioning unit is configured to:

In an exemplary embodiment, the positioning unit is further configured to:

In an exemplary embodiment, the third recovery unit is further configured to:

In an exemplary embodiment, the generating unit is further configured to:

Determine whether the target fault data falls within the alarm threshold range;

Optionally, the optional examples in this embodiment may refer to the examples described in the above embodiments and optional implementation modes, and this embodiment will not be described in detail here.

Obviously, those skilled in the art should understand that the above modules or steps of the present application can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, and optionally, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order from that herein, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.

The above are only optional implementation modes of the present application. It should be pointed out that, for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present application. These improvements and modifications should also be regarded as the protection scope of the present application.

Claims

A method for filtering root causes of server failures, characterized by comprising:

Get the target fault alarm generated in the server;

Classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;

Determine whether to report the target fault alarm according to the target alarm type.
The method according to claim 1, characterized in that the classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain the target alarm type comprises:

Searching for an associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is the associated alarm, and the first alarm information includes the associated alarm field;

In a case where the associated alarm field is used to indicate that the target fault alarm is the associated alarm, determining that the target alarm type is the associated alarm;

In a case where the associated alarm field is used to indicate that the target fault alarm is not the associated alarm, it is determined that the target alarm type is the root cause alarm.
The method according to claim 1, characterized in that the classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain the target alarm type comprises:

Extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;

The target fault alarm is classified according to the target alarm feature to obtain the target alarm type.
The method according to claim 3 is characterized in that the classifying the target fault alarm according to the target alarm feature to obtain the target alarm type comprises:

In a case where the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, determining that the target alarm type is the associated alarm;

In a case where the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be the root cause alarm.
The method according to claim 1, characterized in that the classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain the target alarm type comprises:

Inputting the target fault alarm into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with the root cause alarm and a second alarm sample labeled with the associated alarm type;

Obtain the target alarm type output by the target alarm classification model.
The method according to claim 1, characterized in that the determining whether to report the target fault alarm according to the target alarm type comprises:

When the target alarm type is the root cause alarm, reporting the target fault alarm;

In the case that the target alarm type is the associated alarm, whether to report the target fault alarm is determined according to the second alarm information carried by the target fault alarm.
The method according to claim 6, characterized in that the determining whether to report the target fault alarm according to the second alarm information carried by the target fault alarm comprises:

Obtaining a target association period corresponding to the target fault alarm, wherein the target association period is used to indicate a time interval in which a target associated fault alarm belonging to the root cause alarm and having an associated relationship with the target fault alarm is located;

Searching whether the target associated fault alarm is obtained within the time range of the target associated period before and after the acquisition time of the target fault alarm;

When the target-related fault alarm is found, ignoring the target fault alarm;

If the target associated fault alarm is not found, the target fault alarm is reported.
The method according to claim 7 is characterized in that before searching whether the target associated fault alarm is obtained within the time range of the target associated period before and after the acquisition time of the target fault alarm, the method further comprises one of the following:

Searching for the target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship;

An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record a target associated fault alarm belonging to the root cause alarm and having an associated relationship with the target fault alarm.
The method according to claim 1, characterized in that after determining whether to report the target fault alarm according to the target alarm type, the method further comprises:

Determining a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm;

When it is detected that the server has reached the recovery timing, the target fault alarm is restored.
The method according to claim 9, characterized in that determining the restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm comprises:

searching for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;

In a case where the restart recovery field is used to indicate that the target fault alarm is recovered after the target device that generated the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.
The method according to claim 10, characterized in that, when it is detected that the server has reached the recovery timing, restoring the target fault alarm comprises:

Detecting whether a restart operation is performed on the target device;

When it is detected that the target device is subjected to the restart operation and the target device is restarted successfully, the target fault alarm is restored.
The method according to claim 10, characterized in that after searching the restart recovery field from the target fault alarm, the method further comprises:

In a case where the restart recovery field is used to indicate that the target fault alarm is not restored after the target device generating the target fault alarm is restarted, searching for a device identification field from the target fault alarm, wherein the device identification field is used to indicate a target device identification of the target device generating the target fault alarm;

The recovery timing is determined to be a device identifier replacement at the location of the target device.
The method according to claim 12, characterized in that, when it is detected that the server has reached the recovery timing, restoring the target fault alarm comprises:

Detecting a device identifier at a location where the target device is located;

When it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier, the target fault alarm is restored.
The method according to claim 1, characterized in that the step of obtaining a target fault alarm generated in the server comprises:

In case a fault is detected in the server, collecting target fault data of the fault;

Locating the alarm source of the fault according to the target fault data to obtain the target alarm source;

Recovering the fault according to the target alarm source;

In the case where the fault recovery fails, the target fault alarm is generated.
The method according to claim 14, characterized in that the collecting target fault data of the fault comprises:

Retry the failure a target number of times;

In case of retry failure, collecting initial fault data of the fault;

Eliminate data that exceeds a target data interval from the initial fault data to obtain reference fault data;

An average operation is performed on the reference fault data to obtain the target fault data.
The method according to claim 14, characterized in that locating the alarm source of the fault according to the target fault data to obtain the target alarm source comprises:

Acquire the fault cause corresponding to the target fault data from the fault data and the fault causes having a corresponding relationship as a candidate fault cause;

Searching for a target fault cause from the candidate fault causes according to the topological relationship of the devices in the server and the target fault data;

A field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source.
The method according to claim 16, characterized in that the step of searching for a target fault cause from the candidate fault causes based on the topological relationship of the devices in the server and the target fault data comprises:

Searching for a target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server;

The candidate fault causes are checked according to the operating status of the devices in the target topological relationship to obtain the target fault cause.
The method according to claim 14, characterized in that the recovering the fault according to the target alarm source comprises:

Obtaining a target recovery process corresponding to the target alarm source from alarm sources and recovery processes having a corresponding relationship;

When the target recovery process is obtained, executing the target recovery process;

When the target recovery process is not obtained or the target recovery process fails to be executed, it is determined that the fault recovery has failed.
The method according to claim 14, characterized in that the generating the target fault alarm comprises:

Determining whether the target fault data falls within an alarm threshold range;

When the target fault data falls within the alarm threshold range, the target fault alarm is generated.
A device for filtering root causes of server failures, characterized by comprising:

An acquisition module is configured to acquire a target fault alarm generated in a server;

A classification module is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;

The first determination module is configured to determine whether to report the target fault alarm according to the target alarm type.
A non-volatile readable storage medium, characterized in that the non-volatile readable storage medium includes a stored program, wherein the program executes the method described in any one of claims 1 to 19 when it is run.
An electronic device comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to execute the method according to any one of claims 1 to 19 through the computer program.