WO2024148857A1 - Method and apparatus for filtering root cause of server fault, and non-volatile readable storage medium and electronic apparatus - Google Patents
Method and apparatus for filtering root cause of server fault, and non-volatile readable storage medium and electronic apparatus Download PDFInfo
- Publication number
- WO2024148857A1 WO2024148857A1 PCT/CN2023/121451 CN2023121451W WO2024148857A1 WO 2024148857 A1 WO2024148857 A1 WO 2024148857A1 CN 2023121451 W CN2023121451 W CN 2023121451W WO 2024148857 A1 WO2024148857 A1 WO 2024148857A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- alarm
- target
- fault
- fault alarm
- target fault
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000001914 filtration Methods 0.000 title claims abstract description 31
- 238000011084 recovery Methods 0.000 claims description 98
- 238000013145 classification model Methods 0.000 claims description 17
- 230000015654 memory Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 5
- 230000008439 repair process Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 10
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 8
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 6
- 238000003745 diagnosis Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
Definitions
- the present application relates to the field of computer technology, and in particular, to a method and device for filtering root causes of server failures, a non-volatile readable storage medium, and an electronic device.
- the current status quo has the following obvious defects: 1. It triggers a large number of service tickets, causing customer panic and leading to direct and indirect economic losses for equipment and service providers; 2. After a large number of alarms are reported, customers and services need to manually determine the root cause of the fault and then complete the fault repair based on the root cause; this increases the RTO (Recovery Time Objective)/RPO (Recovery Point Objective).
- the embodiments of the present application provide a method and device for filtering the root causes of server failures, a non-volatile readable storage medium, and an electronic device, so as to at least solve the problem of low efficiency in repairing server failures in the related art.
- a method for filtering root causes of server failures including:
- the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
- the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:
- the associated alarm field is used to indicate that the target fault alarm is an associated alarm, determining that the target alarm type is an associated alarm;
- the associated alarm field is used to indicate that the target fault alarm is not an associated alarm, it is determined that the target alarm type is a root cause alarm.
- the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:
- Extracting a target alarm feature from the target fault alarm wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;
- the target fault alarm is classified according to the target alarm characteristics to obtain the target alarm type.
- the target fault alarm is classified according to the target alarm feature to obtain the target alarm type, including:
- the target alarm type is determined to be a related alarm. police;
- the target alarm type is determined to be a root cause alarm.
- the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:
- the target alarm classification model is obtained by training the initial alarm classification model using the first alarm sample labeled with the root cause alarm and the second alarm sample labeled with the associated alarm type;
- whether to report a target fault alarm is determined according to the target alarm type, including:
- the target alarm type is an associated alarm, it is determined whether to report the target fault alarm according to the second alarm information carried by the target fault alarm.
- determining whether to report the target fault alarm according to the second alarm information carried by the target fault alarm includes:
- the target fault alarm is reported.
- the method before searching whether a target associated fault alarm is obtained within a time range of a target associated period before and after the acquisition time of the target fault alarm, the method further includes one of the following:
- An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.
- the method further includes:
- the target failure alarm is restored.
- determining a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm includes:
- restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;
- the recovery timing is determined to be the restart of the target device.
- the target fault alarm is restored, including:
- the target fault alarm is restored.
- the method further includes:
- restart recovery field is used to indicate that the target fault alarm is not restored after the target device generating the target fault alarm is restarted, searching the device identification field from the target fault alarm, wherein the device identification field is used to indicate the target device identification of the target device generating the target fault alarm;
- the recovery timing is determined to be a device ID change at the location of the target device.
- the target fault alarm is restored, including:
- the target fault alarm is restored.
- obtain target fault alarms generated in the server including:
- target fault data of the fault is collected
- target fault data of the fault is collected, including:
- the reference fault data is averaged to obtain the target fault data.
- locating the alarm source of the fault according to the target fault data to obtain the target alarm source includes:
- a field replaceable unit (FRU) corresponding to the target fault cause in the server is determined as a target alarm source.
- searching for a target fault cause from candidate fault causes according to the topological relationship of the devices in the server and the target fault data includes:
- the candidate fault causes are checked to obtain the target fault cause.
- restore the fault according to the target alarm source including:
- the target recovery process is executed
- generate target fault alarms including:
- a device for filtering root causes of server failures including:
- An acquisition module is configured to acquire a target fault alarm generated in a server
- a classification module is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
- the first determination module is configured to determine whether to report a target fault alarm according to a target alarm type.
- a non-volatile readable storage medium in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned server failure root cause filtering method when running.
- an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for filtering the root cause of server failure through the computer program.
- a target fault alarm generated in a server is obtained; the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the associated fault alarm belonging to the root cause alarm; whether to report the target fault alarm is determined according to the target alarm type, that is, firstly the target fault alarm generated in the server is obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the target alarm type includes a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the associated fault alarm belonging to the root cause alarm, and finally, whether
- FIG1 is a schematic diagram of a hardware environment of a method for filtering root causes of server failures according to an embodiment of the present application
- FIG2 is a flow chart of a method for filtering root causes of server failures according to an embodiment of the present application
- FIG3 is a schematic diagram of generating a target fault alarm according to an embodiment of the present application.
- FIG4 is a schematic diagram of positioning a target alarm source according to an embodiment of the present application.
- FIG5 is a schematic diagram of a target fault alarm database according to an embodiment of the present application.
- FIG6 is a structural block diagram of a device for filtering root causes of server failures according to an embodiment of the present application.
- FIG1 is a hardware environment diagram of a method for filtering the root cause of a server failure according to an embodiment of the present application.
- the computer terminal may include one or more (only one is shown in FIG1 ) processors 102 (the processor 102 may include but is not limited to a microprocessor MCU (MicroController Unit) or a programmable logic device FPGA (Field-Programmable Gate Array) and other processing devices) and a memory 104 configured to store data.
- processors 102 may include but is not limited to a microprocessor MCU (MicroController Unit) or a programmable logic device FPGA (Field-Programmable Gate Array) and other processing devices
- a memory 104 configured to store data.
- the above-mentioned computer terminal may also include a transmission device 106 configured to have a communication function and an input and output device 108.
- a transmission device 106 configured to have a communication function
- an input and output device 108 an input and output device 108.
- the structure shown in FIG1 is only for illustration and does not limit the structure of the above-mentioned computer terminal.
- the computer terminal may also include more or fewer components than those shown in FIG1 , or have a different configuration with the same function as that shown in FIG1 or more functions than those shown in FIG1 .
- the memory 104 is configured to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the filtering method for the root cause of server failure in the embodiment of the present application.
- the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, that is, to implement the above method.
- the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
- the memory 104 may include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the transmission device 106 is configured to receive or send data via a network.
- the above-mentioned network example may include a wireless network provided by a communication provider of a computer terminal.
- the transmission device 106 includes a network adapter (Network Interface Controller, referred to as NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
- the transmission device 106 can be a radio frequency (Radio Frequency, referred to as RF) module, which is configured to communicate with the Internet wirelessly.
- RF Radio Frequency
- FIG. 2 is a flow chart of a method for filtering root causes of server failures according to an embodiment of the present application. As shown in FIG. 2 , the process includes the following steps:
- Step S202 obtaining a target fault alarm generated in the server
- Step S204 classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
- Step S206 Determine whether to report a target fault alarm according to the target alarm type.
- the target fault alarm generated in the server is first obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type.
- the target alarm type includes root cause alarm and associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm.
- it is determined whether to report the target fault alarm according to the target alarm type, avoiding the situation where a large number of associated alarms are reported, resulting in a reduction in the efficiency of server fault repair.
- step S202 a target fault alarm generated in the server is obtained.
- the target fault alarm may be, but is not limited to, an alarm generated for the server regarding any device or hardware abnormality
- the hardware may be, but is not limited to, components or devices such as a mainboard and a chassis.
- the above-mentioned multiple devices may constitute a service cluster, and various services are deployed on each device node, and there are also software and hardware dependencies between these services.
- the target fault alarm generated in the server can be obtained in the following manner but is not limited to: when a fault is detected in the server, the target fault data of the fault is collected; the alarm source of the fault is located according to the target fault data to obtain the target alarm source; the fault is restored according to the target alarm source; and when the fault recovery fails, a target fault alarm is generated.
- Figure 3 is a schematic diagram of the generation of a target fault alarm according to an embodiment of the present application.
- the target fault data of the fault is collected through the hardware collection filtering layer, and the alarm source of the fault is located according to the target fault data through the fault diagnosis filtering layer based on the target fault data to obtain the target alarm source, and then the fault is restored according to the target alarm source through the fault repair filtering layer. If the fault recovery fails, a target fault alarm is generated.
- the target fault data may include but is not limited to data perceived by the hardware perception layer collected by the hardware acquisition filtering layer, such as: hardware temperature, voltage, RAS (Reliability, Availability and Serviceability) signals, etc.
- data perceived by the hardware perception layer collected by the hardware acquisition filtering layer such as: hardware temperature, voltage, RAS (Reliability, Availability and Serviceability) signals, etc.
- the hardware perception layer starting from the bottom layer provides basic hardware information capabilities of devices, components, motherboards, and chassis; multiple device nodes constitute a business cluster; various services are deployed on each device node, and there are also software and hardware dependencies between these services; above the node chassis management service and the business management service is cluster management (CM, Cluster Mangement).
- CM Cluster Mangement
- the cluster management level it is necessary to aggregate the fault alarm information of each node, and then suppress the alarm through the designed alarm dependency, and suppress the alarm through the root cause alarm filter; it is also possible to suppress the alarm and the root cause alarm through the intelligent reasoning filter; in the above-mentioned root cause alarm filtering framework, the root cause is blocked layer by layer through four filtering levels, and finally the root cause alarm solution is implemented.
- target fault data of a fault may be collected in the following manner, but is not limited to: retrying the fault a target number of times; if the retry fails, collecting initial fault data of the fault; eliminating data in the initial fault data that exceeds the target data interval to obtain reference fault data; and performing an average operation on the reference fault data to obtain target fault data.
- the target number of retries are performed on the fault; in the case of retry failure, the initial fault data of the fault may be collected, but is not limited to, for a fault, multiple retries after failure are reported to the fault diagnosis layer, filtering devices, and avoiding instantaneous faults caused by environmental interference, etc.;
- reference fault data which may refer to, but is not limited to, setting a reasonable data interval, and discarding data exceeding the reasonable interval as false values to eliminate instantaneous false values, and not reporting to the fault diagnosis layer;
- performing an average operation on the reference fault data to obtain the target fault data may include, but is not limited to, taking the average of the data through an average algorithm and then reporting it to the fault diagnosis layer.
- the alarm source of the fault can be located according to the target fault data to obtain the target alarm source in the following manner but not limited to: obtaining the fault cause corresponding to the target fault data from the fault data and fault causes with a corresponding relationship as a candidate fault cause; searching for the target fault cause from the candidate fault causes according to the topological relationship of the equipment in the server and the target fault data; and determining the field replaceable unit FRU corresponding to the target fault cause in the server as the target alarm source.
- FIG. 4 is a schematic diagram of positioning a target alarm source according to an embodiment of the present application.
- the fault cause corresponding to the target fault data is obtained from the fault data and the fault cause having a corresponding relationship as a candidate fault cause.
- the chassis management service hardware collection filter layer of the mainboard A collects the target fault data indicating that the IIC (Inter-Integrated Circuit) sensor D on the FRU N (FRU, Field Replace Unit) fails, and the corresponding candidate fault causes may include:
- the MCU B IIC controller of the mainboard A is faulty
- IIC switch C to IIC 2 channel on FRU N is faulty
- the target fault cause is found from the candidate fault causes; the field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source.
- the target fault cause can be found from the candidate fault causes based on the topological relationship of the devices in the server and the target fault data in the following manner, but is not limited to: finding the target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server; and checking the candidate fault causes according to the operating status of the devices in the target topological relationship to obtain the target fault cause.
- the candidate fault causes are checked according to the running status of the devices in the target topology relationship to obtain the target fault cause, as shown in FIG4 , for example,
- the target fault causes may include:
- the MCU B IIC controller of the mainboard A is faulty
- the target failure causes may include:
- IIC switch C to IIC 2 channel on FRU N is faulty
- the fault can be recovered according to the target alarm source in the following manner but is not limited to: obtaining the target recovery process corresponding to the target alarm source from the alarm sources and recovery processes with corresponding relationships; when the target recovery process is obtained, executing the target recovery process; when the target recovery process is not obtained, or the target recovery process fails to execute, determining that the fault recovery has failed.
- the field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source, and the target alarm source obtains the target recovery process corresponding to the target alarm source in the fault repair filter layer.
- the target recovery process is executed.
- the fault repair filter layer is responsible for automatically recovering the software and hardware systems that have entered the abnormal state by mistake, so as to avoid abnormal stagnation and expansion of the situation. For example, 1.
- the state machine enters the abnormal state due to some low-probability triggering reason and cannot complete the normal negotiation, and the device cannot be normally connected to the system.
- the device can be connected to the system through the retraining mechanism, or by powering off and powering on the endpoint device to restart the training negotiation, thereby improving the device availability and avoiding the generation of alarms; 2.
- the IIC bus can be restored by resetting the IIC device tree and other measures.
- the target fault alarm may be generated in the following manner, but is not limited to: determining whether the target fault data falls within the alarm threshold range; and generating the target fault alarm if the target fault data falls within the alarm threshold range.
- the alarm threshold range needs to be set reasonably, such as the temperature/voltage hysteresis design, to avoid the ping-pong effect of repeated alarms.
- the alarm value of a certain temperature is 39 degrees Celsius
- the alarm recovery value is set to 37 degrees Celsius.
- the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, wherein the alarm types of the fault alarm include: root cause alarm and associated alarm, the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm.
- the alarm types of the fault alarm include: root cause alarm and associated alarm, the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm.
- the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but not limited to: searching the associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field; when the associated alarm field is used to indicate that the target fault alarm is an associated alarm, determining that the target alarm type is an associated alarm; when the associated alarm field is used to indicate that the target fault alarm is not an associated alarm, determining that the target alarm type is a root cause alarm.
- the associated alarm field may include, but is not limited to, an alarm ID (Identity document), wherein the alarm ID may be an alarm type code, which is globally unique, and this field is a unique identity index identification field for distinguishing a certain type of alarm event.
- an alarm ID Identity document
- the alarm ID may be an alarm type code, which is globally unique, and this field is a unique identity index identification field for distinguishing a certain type of alarm event.
- the target alarm type of the target fault alarm can be determined based on the associated alarm field.
- the target alarm type can indicate whether the target fault alarm has an associated dependency on other alarms.
- the target alarm type indicates that the target fault alarm has an associated dependency on other alarms, that is, it is an associated alarm, the target alarm type is determined to be an associated alarm, and it is necessary to determine whether there is a root cause alarm; if the alarm has no dependency and the target fault alarm is not an associated alarm, the target alarm type is determined to be a root cause alarm and can be reported directly.
- the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but is not limited to: extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate the cause of the target fault alarm, and the first alarm information includes the target alarm feature; classifying the target fault alarm according to the target alarm feature to obtain the target alarm type.
- the target fault alarm may be classified by, but is not limited to, determining based on a target alarm feature corresponding to the target fault alarm.
- the target fault alarm can be classified according to the target alarm feature to obtain the target alarm type in the following manner but not limited to: when the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, the target alarm type is determined to be an associated alarm; when the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be a root cause alarm.
- the target alarm feature when used to indicate that the cause of the target fault alarm is a hardware device in the server, it may include but is not limited to physical damage to the hardware.
- the target alarm type may be determined to be a root cause alarm.
- the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but not limited to: the target fault alarm is input into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with a root cause alarm and a second alarm sample labeled with an associated alarm type; and the target alarm type output by the target alarm classification model is obtained.
- the target alarm classification model may classify the input target fault alarm to determine the target alarm type of the target fault alarm.
- step S206 whether to report the target fault alarm is determined according to the target alarm type.
- whether to report the target fault alarm depends on the target alarm type.
- the target fault alarm with the target alarm type as the root cause alarm can be reported.
- whether to report a target fault alarm can be determined according to the target alarm type in the following manner but is not limited to: when the target alarm type is a root cause alarm, the target fault alarm is reported; when the target alarm type is an associated alarm, whether to report the target fault alarm is determined according to the second alarm information carried by the target fault alarm.
- the target alarm type is an associated alarm, it is determined whether to report the target fault alarm according to the second alarm information carried by the target fault alarm.
- the target association period can be, but is not limited to, the time interval for reporting the root cause of the associated alarm. If the root cause alarm is generated within the association period, the alarm is invalid and does not need to be reported; if the root cause alarm is not generated within the association period, the alarm is effectively reported, wherein the design of the target association period can be based on the time difference between the target associated fault alarm and the target fault alarm that are associated with the root cause alarm. For example, if the maximum possible time difference between the target fault alarm and the target associated fault alarm is 1 minute, the association period can be set to 1 minute. This attribute can be saved in the cluster alarm root cause filtering layer and the event database corresponding to the target fault alarm as an inherent attribute of the target fault alarm. After the associated alarm is reported to the CM, the CM needs to determine whether there is a root cause alarm reported within 1 minute before and after the alarm report. If so, the associated alarm does not need to be reported, and only the root cause alarm is reported.
- the method before searching whether a target associated fault alarm is acquired within a time range of a target associated period before and after the acquisition time of the target fault alarm, the method further includes one of the following:
- An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.
- Figure 5 is a schematic diagram of a target fault alarm database according to an embodiment of the present application.
- a target associated fault alarm corresponding to the target fault alarm is searched from the fault alarms and associated fault alarms with corresponding relationships.
- the alarm ID of the target fault alarm is known, and the target associated fault alarm (root cause alarm ID 1 and root cause alarm ID N) corresponding to the target fault alarm (alarm ID) is searched from the fault alarms and associated fault alarms with corresponding relationships.
- the possible root cause alarms of the associated alarms are determined.
- This attribute is stored in the alarm event database of the cluster alarm root cause filtering layer as an inherent attribute of the alarm.
- the CM searches the root cause alarms stored in the database at the alarm root cause filtering layer to find out whether the root cause alarm has been reported. If there is a root cause alarm, there is no need to report the associated alarms, and only the root cause alarms need to be reported. If the associated alarms do not find the root cause alarms, the associated alarms themselves are the root cause, which can be To report.
- the target fault alarm when designing a target alarm type of a target fault alarm, it is first analyzed whether the target fault alarm is a root cause alarm for the problem, or has an associated dependency on an existing alarm, that is, it may be a fault transmission result generated by other root cause alarms; confirm the attribute of the [Is Alarm Associated] field, and the attribute is saved in the alarm event database of the cluster alarm root cause filter layer as an inherent attribute of the alarm.
- the following methods may be used, but are not limited to: determining a recovery timing corresponding to the target fault alarm based on the third alarm information carried by the target fault alarm; and restoring the target fault alarm when it is detected that the server has reached the recovery timing.
- the target fault alarm can be restored in different ways.
- the target fault alarm is restored.
- the recovery timing corresponding to the target fault alarm can be determined based on the third alarm information carried by the target fault alarm in the following manner, but is not limited to: searching the restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is restored after the target device that generated the target fault alarm is restarted, and the third alarm information includes the restart recovery field; when the restart recovery field is used to indicate that the target fault alarm is restored after the target device that generated the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.
- the restart recovery field is searched from the target fault alarm, for example, when [Restart Recovery] is “Yes”, the recovery timing is determined to be the restart of the target device.
- the attribute of the [Restart Recovery] field which is stored in the alarm event database of the cluster alarm root cause filter layer as an inherent attribute of the alarm. If the alarm will be restored after restart, the CM needs to report that the alarm has been restored after the device is restarted and update the local database; if the alarm will not be restored after restart, the CM will not report the alarm recovery after the device is restarted.
- the target fault alarm can be restored when it is detected that the server has reached the recovery time in the following manner but is not limited to: detecting whether the target device has been restarted; and restoring the target fault alarm when it is detected that the target device has been restarted and the target device has been restarted successfully.
- the following methods may be used but are not limited to: when the restart recovery field is used to indicate that the target fault alarm is not recovered after the target device that generates the target fault alarm is restarted, searching for the device identification field from the target fault alarm, wherein the device identification field is used to indicate the target device identification of the target device that generates the target fault alarm; and determining the recovery timing as the replacement of the device identification at the location of the target device.
- the target fault alarm can be restored when it is detected that the server has reached the recovery time in the following manner, but is not limited to: detecting the device identifier at the location of the target device; and restoring the target fault alarm when it is detected that the device identifier at the location of the target device has changed from the target device identifier to the reference device identifier.
- the device identifier at the location of the target device is changed from the target device identifier to the reference device identifier, it indicates that the target device has been replaced, and the target fault alarm is restored.
- the management software customer interface can provide real fault alarms to customers/services, and customers/services can perform equipment maintenance based on accurate fault alarms.
- accurate alarm reports are provided, effectively improving the accuracy and efficiency of services; customers will not panic due to multiple alarms, reducing direct and indirect economic losses of equipment and service providers; faults are detected, diagnosed, repaired, aggregated, root cause diagnosed, and root cause alarmed, ensuring that root cause alarms can be issued after problems occur, improving problem solving efficiency, reducing RTO and RPO, and improving customer satisfaction.
- the method according to the above embodiment can be implemented by software.
- the required general hardware platform can be implemented, of course, it can also be implemented through hardware, but in many cases the former is a better implementation method.
- the technical solution of the present application can essentially or contribute to the prior art in the form of a software product, which is stored in a non-volatile readable storage medium (such as ROM/RAM, disk, CD), including a number of instructions to enable a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of each embodiment of the present application.
- a non-volatile readable storage medium such as ROM/RAM, disk, CD
- FIG6 is a structural block diagram of a device for filtering root causes of server failures according to an embodiment of the present application; as shown in FIG6 , the device comprises:
- An acquisition module 602 is configured to acquire a target fault alarm generated in a server
- the classification module 604 is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
- the first determination module 606 is configured to determine whether to report a target fault alarm according to a target alarm type.
- the target fault alarm generated in the server is first obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, which includes root cause alarm and associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm. Finally, it is determined whether to report the target fault alarm according to the target alarm type, avoiding the situation where a large number of associated alarms are reported, resulting in a reduction in the efficiency of server fault repair.
- the above technical solution solves the problems of low efficiency of server fault repair in related technologies, and achieves the technical effect of improving the efficiency of server fault repair.
- the classification module includes:
- a first search unit is configured to search for an associated alarm field from a target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field;
- a first determining unit is configured to determine that the target alarm type is an associated alarm when the associated alarm field is used to indicate that the target fault alarm is an associated alarm;
- the second determining unit is configured to determine that the target alarm type is a root cause alarm when the associated alarm field is used to indicate that the target fault alarm is not an associated alarm.
- the classification module includes:
- an extraction unit configured to extract a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;
- the classification unit is configured to classify the target fault alarm according to the target alarm feature to obtain the target alarm type.
- the classification unit is further configured to:
- the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, determining the target alarm type to be a related alarm;
- the target alarm type is determined to be a root cause alarm.
- the classification module includes:
- An input unit is configured to input a target fault alarm into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with a root cause alarm and a second alarm sample labeled with an associated alarm type;
- the acquisition unit is configured to acquire the target alarm type output by the target alarm classification model.
- the first determining module includes:
- the reporting unit is configured to report a target fault alarm when the target alarm type is a root cause alarm
- the third determination unit is configured to determine whether to report the target fault alarm according to the second alarm information carried by the target fault alarm when the target alarm type is an associated alarm.
- the third determining unit is further configured to:
- the target fault alarm is reported.
- the apparatus further comprises one of the following:
- the first search module is configured to search for a target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship before searching whether the target associated fault alarm is obtained within a time range of a target associated period before and after the acquisition time of the target fault alarm;
- the extraction module is configured to extract an associated fault alarm field from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.
- the apparatus further comprises:
- the second determination module is configured to determine the restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm after determining whether to report the target fault alarm according to the target alarm type;
- the recovery module is configured to restore the target fault alarm when it is detected that the server reaches the recovery time.
- the second determining module includes:
- a second search unit is configured to search for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;
- the fourth determining unit is configured to determine the recovery timing as the restart of the target device when the restart recovery field is used to indicate that the target fault alarm is recovered after the target device that generated the target fault alarm is restarted.
- the recovery module includes:
- a first detection unit is configured to detect whether a restart operation is performed on the target device
- the first recovery unit is configured to recover the target fault alarm when it is detected that the target device is restarted and the target device is restarted successfully.
- the apparatus further comprises:
- a second search module is configured to search the target fault alarm for a device identification field after searching the restart recovery field in the target fault alarm, in the case where the restart recovery field is used to indicate that the target fault alarm is not recovered after the target device generating the target fault alarm is restarted, wherein the device identification field is used to indicate the target device identification of the target device generating the target fault alarm;
- the third determining module is configured to determine the recovery timing as a device identifier replacement at a location where the target device is located.
- the recovery module includes:
- a second detection unit is configured to detect a device identifier at a location where a target device is located
- the second restoring unit is configured to restore the target fault alarm when it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier.
- the acquisition module includes:
- a collection unit is configured to collect target fault data of a fault when a fault is detected in the server;
- a positioning unit is configured to locate the alarm source of the fault according to the target fault data to obtain the target alarm source;
- a third recovery unit is configured to recover the fault according to a target alarm source
- the generating unit is configured to generate a target fault alarm in case of failure of fault recovery.
- the acquisition unit is configured to:
- the reference fault data is averaged to obtain the target fault data.
- the positioning unit is configured to:
- a field replaceable unit (FRU) corresponding to the target fault cause in the server is determined as a target alarm source.
- the positioning unit is further configured to:
- the candidate fault causes are checked to obtain the target fault cause.
- the third recovery unit is further configured to:
- the target recovery process is executed
- the generating unit is further configured to:
- modules or steps of the present application can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, and optionally, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order from that herein, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation.
- the present application is not limited to any specific combination of hardware and software.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present application relates to the technical field of computers. Disclosed are a method and apparatus for filtering the root cause of a server fault, and a non-volatile readable storage medium and an electronic apparatus. The method for filtering the root cause of a server fault comprises: acquiring a target fault alarm which is generated in a server; according to first alarm information which is carried in the target fault alarm, classifying the target fault alarm, so as to obtain a target alarm type, wherein alarm types of fault alarms comprise: a root cause alarm and an associated alarm, the root cause alarm being used for indicating that a corresponding fault alarm is the root cause of a server fault, and the associated alarm being used for indicating that the corresponding fault alarm is caused by an associated fault alarm that belongs to the root cause alarm; and according to the target alarm type, determining whether to report the target fault alarm. By using the technical solution, problems in the related art, such as the efficiency of repairing a server fault being relatively low, are solved.
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2023年1月9日提交中国专利局,申请号为202310030520.6,申请名称为“服务器故障根因的过滤方法和装置、存储介质及电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the Chinese Patent Office on January 9, 2023, with application number 202310030520.6 and application name “Filtering method and device, storage medium and electronic device for root causes of server failures”, the entire contents of which are incorporated by reference into this application.
本申请涉及计算机技术领域,特别的,涉及一种服务器故障根因的过滤方法和装置、非易失性可读存储介质及电子装置。The present application relates to the field of computer technology, and in particular, to a method and device for filtering root causes of server failures, a non-volatile readable storage medium, and an electronic device.
目前,在存储、服务器、云数中心、IT(Information Technology,信息产业)、嵌入式计数等等领域,所有智能器件都依赖于固件、系统的稳定。在上述场景中,在研发、测试、客户业务运行过程中,当设备软硬件系统发生错误、故障时,一般处理流程需要经过错误探测、故障诊断、故障修复、故障告警报告。但是由于计算机系统软硬件越来越复杂,系统中特性、服务依赖深度较大,问题发生后多个服务、特性都会产生故障传导并触发多个服务、特性重复报告故障告警。At present, in the fields of storage, servers, cloud data centers, IT (Information Technology), embedded computing, etc., all intelligent devices rely on the stability of firmware and systems. In the above scenarios, during the development, testing, and customer business operation, when errors or failures occur in the equipment hardware and software systems, the general processing flow needs to go through error detection, fault diagnosis, fault repair, and fault alarm reporting. However, due to the increasing complexity of computer system hardware and software, the system has a high degree of dependency on features and services. After a problem occurs, multiple services and features will cause fault transmission and trigger multiple services and features to repeatedly report fault alarms.
当前现状有如下明显缺陷:1、引发大量的服务工单,引发客户恐慌,导致设备、服务提供商的直接、间接经济损失;2、大量告警上报后,客户、服务需要人工判断故障根因,再针对根因完成故障修复;增大了RTO(Recovery Time Objective,恢复时间目标)/RPO(Recovery Point Objective,恢复点目标)。The current status quo has the following obvious defects: 1. It triggers a large number of service tickets, causing customer panic and leading to direct and indirect economic losses for equipment and service providers; 2. After a large number of alarms are reported, customers and services need to manually determine the root cause of the fault and then complete the fault repair based on the root cause; this increases the RTO (Recovery Time Objective)/RPO (Recovery Point Objective).
针对相关技术中,服务器故障修复的效率较低等问题,尚未提出有效的解决方案。With regard to the problems in related technologies such as low efficiency in repairing server failures, no effective solutions have been proposed yet.
发明内容Summary of the invention
本申请实施例提供了一种服务器故障根因的过滤方法和装置、非易失性可读存储介质及电子装置,以至少解决相关技术中,服务器故障修复的效率较低等问题。The embodiments of the present application provide a method and device for filtering the root causes of server failures, a non-volatile readable storage medium, and an electronic device, so as to at least solve the problem of low efficiency in repairing server failures in the related art.
根据本申请实施例的一个实施例,提供了一种服务器故障根因的过滤方法,包括:According to an embodiment of the present application, a method for filtering root causes of server failures is provided, including:
获取服务器中产生的目标故障告警;Get the target fault alarm generated in the server;
根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的;The target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
根据目标告警类型确定是否上报目标故障告警。Determine whether to report the target fault alarm based on the target alarm type.
可选的,根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,包括:Optionally, the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:
从目标故障告警中查找关联告警字段,其中,关联告警字段用于指示目标故障告警是否为关联告警,第一告警信息包括关联告警字段;Searching for an associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field;
在关联告警字段用于指示目标故障告警为关联告警的情况下,确定目标告警类型为关联告警;In the case where the associated alarm field is used to indicate that the target fault alarm is an associated alarm, determining that the target alarm type is an associated alarm;
在关联告警字段用于指示目标故障告警不为关联告警的情况下,确定目标告警类型为根因告警。In a case where the associated alarm field is used to indicate that the target fault alarm is not an associated alarm, it is determined that the target alarm type is a root cause alarm.
可选的,根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,包括:Optionally, the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:
从目标故障告警中提取目标告警特征,其中,目标告警特征用于指示目标故障告警的发生原因,第一告警信息包括目标告警特征;Extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;
根据目标告警特征对目标故障告警进行分类,得到目标告警类型。The target fault alarm is classified according to the target alarm characteristics to obtain the target alarm type.
可选的,根据目标告警特征对目标故障告警进行分类,得到目标告警类型,包括:Optionally, the target fault alarm is classified according to the target alarm feature to obtain the target alarm type, including:
在目标告警特征用于指示目标故障告警的发生原因为其他故障告警的情况下,确定目标告警类型为关联告
警;When the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, the target alarm type is determined to be a related alarm. police;
在目标告警特征用于指示目标故障告警的发生原因为服务器中的硬件设备的情况下,确定目标告警类型为根因告警。When the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be a root cause alarm.
可选的,根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,包括:Optionally, the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, including:
将目标故障告警输入目标告警分类模型,其中,目标告警分类模型是使用标注了根因告警的第一告警样本和标注了关联告警类型的第二告警样本对初始告警分类模型进行训练得到的;Inputting the target fault alarm into the target alarm classification model, wherein the target alarm classification model is obtained by training the initial alarm classification model using the first alarm sample labeled with the root cause alarm and the second alarm sample labeled with the associated alarm type;
获取目标告警分类模型输出的目标告警类型。Get the target alarm type output by the target alarm classification model.
可选的,根据目标告警类型确定是否上报目标故障告警,包括:Optionally, whether to report a target fault alarm is determined according to the target alarm type, including:
在目标告警类型为根因告警的情况下,上报目标故障告警;When the target alarm type is a root cause alarm, report the target fault alarm;
在目标告警类型为关联告警的情况下,根据目标故障告警携带的第二告警信息确定是否上报目标故障告警。In the case where the target alarm type is an associated alarm, it is determined whether to report the target fault alarm according to the second alarm information carried by the target fault alarm.
可选的,根据目标故障告警携带的第二告警信息确定是否上报目标故障告警,包括:Optionally, determining whether to report the target fault alarm according to the second alarm information carried by the target fault alarm includes:
获取目标故障告警对应的目标关联周期,其中,目标关联周期用于指示与目标故障告警具有关联关系的属于根因告警的目标关联故障告警所在的时间区间;Obtaining a target association period corresponding to the target fault alarm, wherein the target association period is used to indicate a time interval in which a target associated fault alarm belonging to a root cause alarm having an associated relationship with the target fault alarm is located;
在目标故障告警的获取时间前后目标关联周期的时间范围内查找是否获取到目标关联故障告警;Check whether the target associated fault alarm is obtained within the time range of the target associated cycle before and after the acquisition time of the target fault alarm;
在查找到目标关联故障告警的情况下,忽略目标故障告警;When a target-related fault alarm is found, the target fault alarm is ignored;
在未查找到目标关联故障告警的情况下,上报目标故障告警。If no target-related fault alarm is found, the target fault alarm is reported.
可选的,在目标故障告警的获取时间前后目标关联周期的时间范围内查找是否获取到目标关联故障告警之前,方法还包括以下之一:Optionally, before searching whether a target associated fault alarm is obtained within a time range of a target associated period before and after the acquisition time of the target fault alarm, the method further includes one of the following:
从具有对应关系的故障告警和关联故障告警中查找目标故障告警对应的目标关联故障告警;Searching for a target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship;
从目标故障告警中提取关联故障告警字段,其中,关联故障告警字段用于记录与目标故障告警具有关联关系的属于根因告警的目标关联故障告警。An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.
可选的,在根据目标告警类型确定是否上报目标故障告警之后,方法还包括:Optionally, after determining whether to report a target fault alarm according to the target alarm type, the method further includes:
根据目标故障告警携带的第三告警信息确定目标故障告警对应的恢复时机;Determine a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm;
在检测到服务器达到恢复时机的情况下,恢复目标故障告警。When it is detected that the server has reached the recovery time, the target failure alarm is restored.
可选的,根据目标故障告警携带的第三告警信息确定目标故障告警对应的恢复时机,包括:Optionally, determining a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm includes:
从目标故障告警中查找重启恢复字段,其中,重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警是否恢复,第三告警信息包括重启恢复字段;searching for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;
在重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警恢复的情况下,确定恢复时机为目标设备重启。In the case where the restart recovery field is used to indicate that the target fault alarm is recovered after the target device generating the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.
可选的,在检测到服务器达到恢复时机的情况下,恢复目标故障告警,包括:Optionally, when it is detected that the server has reached the recovery time, the target fault alarm is restored, including:
检测目标设备是否被执行重启操作;Detect whether the target device has been rebooted;
在检测到目标设备被执行重启操作且目标设备重启成功的情况下,恢复目标故障告警。When it is detected that the target device is restarted and the target device restarts successfully, the target fault alarm is restored.
可选的,在从目标故障告警中查找重启恢复字段之后,方法还包括:Optionally, after searching the restart recovery field from the target fault alarm, the method further includes:
在重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警不恢复的情况下,从目标故障告警中查找设备标识字段,其中,设备标识字段用于指示产生目标故障告警的目标设备的目标设备标识;In the case where the restart recovery field is used to indicate that the target fault alarm is not restored after the target device generating the target fault alarm is restarted, searching the device identification field from the target fault alarm, wherein the device identification field is used to indicate the target device identification of the target device generating the target fault alarm;
确定恢复时机为目标设备所在位置上的设备标识更换。The recovery timing is determined to be a device ID change at the location of the target device.
可选的,在检测到服务器达到恢复时机的情况下,恢复目标故障告警,包括:Optionally, when it is detected that the server has reached the recovery time, the target fault alarm is restored, including:
检测目标设备所在位置上的设备标识;Detecting the device identification at the location of the target device;
在检测到目标设备所在位置上的设备标识从目标设备标识更换为参考设备标识的情况下,恢复目标故障告警。
When it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier, the target fault alarm is restored.
可选的,获取服务器中产生的目标故障告警,包括:Optionally, obtain target fault alarms generated in the server, including:
在检测到服务器中发生故障的情况下,采集故障的目标故障数据;In the case where a fault is detected in the server, target fault data of the fault is collected;
根据目标故障数据对故障的告警源进行定位,得到目标告警源;Locate the alarm source of the fault according to the target fault data to obtain the target alarm source;
根据目标告警源对故障进行恢复;Recover from the fault according to the target alarm source;
在故障恢复失败的情况下,生成目标故障告警。In the event of a failure in fault recovery, a target fault alarm is generated.
可选的,采集故障的目标故障数据,包括:Optionally, target fault data of the fault is collected, including:
对故障进行目标次数的重试;Retry the failure a target number of times;
在重试失败的情况下,采集故障的初始故障数据;In case of retry failure, collect the initial fault data of the fault;
剔除初始故障数据中超出目标数据区间的数据,得到参考故障数据;Eliminate the data that exceeds the target data interval in the initial fault data to obtain reference fault data;
对参考故障数据进行平均数运算,得到目标故障数据。The reference fault data is averaged to obtain the target fault data.
可选的,根据目标故障数据对故障的告警源进行定位,得到目标告警源,包括:Optionally, locating the alarm source of the fault according to the target fault data to obtain the target alarm source includes:
从具有对应关系的故障数据和故障原因中获取目标故障数据对应的故障原因作为候选故障原因;Obtaining the fault cause corresponding to the target fault data from the fault data and the fault causes having a corresponding relationship as a candidate fault cause;
根据服务器中设备的拓扑关系和目标故障数据从候选故障原因中查找目标故障原因;Find the target fault cause from the candidate fault causes according to the topological relationship of the devices in the server and the target fault data;
将目标故障原因在服务器中对应的现场可更换单元FRU确定为目标告警源。A field replaceable unit (FRU) corresponding to the target fault cause in the server is determined as a target alarm source.
可选的,根据服务器中设备的拓扑关系和目标故障数据从候选故障原因中查找目标故障原因,包括:Optionally, searching for a target fault cause from candidate fault causes according to the topological relationship of the devices in the server and the target fault data includes:
从服务器中设备的拓扑关系中查找目标故障数据对应的目标拓扑关系;Find the target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server;
根据目标拓扑关系中设备的运行状态对候选故障原因进行排查,得到目标故障原因。According to the running status of the equipment in the target topology relationship, the candidate fault causes are checked to obtain the target fault cause.
可选的,根据目标告警源对故障进行恢复,包括:Optionally, restore the fault according to the target alarm source, including:
从具有对应关系的告警源和恢复流程中获取目标告警源对应的目标恢复流程;Obtaining a target recovery process corresponding to a target alarm source from alarm sources and recovery processes having a corresponding relationship;
在获取到目标恢复流程的情况下,执行目标恢复流程;When the target recovery process is obtained, the target recovery process is executed;
在未获取到目标恢复流程,或者,目标恢复流程执行失败的情况下,确定故障恢复失败。When the target recovery process is not obtained or the target recovery process fails to be executed, it is determined that the fault recovery has failed.
可选的,生成目标故障告警,包括:Optionally, generate target fault alarms, including:
判断目标故障数据是否落入告警阈值范围内;Determine whether the target fault data falls within the alarm threshold range;
在目标故障数据落入告警阈值范围内的情况下,生成目标故障告警。When the target fault data falls within the alarm threshold range, a target fault alarm is generated.
根据本申请实施例的另一个实施例,还提供了一种服务器故障根因的过滤装置,包括:According to another embodiment of the present application, a device for filtering root causes of server failures is also provided, including:
获取模块,被设置为获取服务器中产生的目标故障告警;An acquisition module is configured to acquire a target fault alarm generated in a server;
分类模块,被设置为根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的;A classification module is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
第一确定模块,被设置为根据目标告警类型确定是否上报目标故障告警。The first determination module is configured to determine whether to report a target fault alarm according to a target alarm type.
根据本申请实施例的又一方面,还提供了一种非易失性可读存储介质,该非易失性可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述服务器故障根因的过滤方法。According to another aspect of the embodiments of the present application, a non-volatile readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned server failure root cause filtering method when running.
根据本申请实施例的又一方面,还提供了一种电子装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,上述处理器通过计算机程序执行上述的服务器故障根因的过滤方法。According to another aspect of an embodiment of the present application, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for filtering the root cause of server failure through the computer program.
在本申请实施例中,获取服务器中产生的目标故障告警;根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的;根据目标告警类型确定是否上报目标故障告警,即首先获取服务器中产生的目标故障告警,然后根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,目标告警类型包括根因告警和关联告警,其中,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的,最后根据目标告警类型确定是否上
报目标故障告警,避免了大量关联告警上报导致降低服务器故障修复的效率的情况出现。采用上述技术方案,解决了相关技术中,服务器故障修复的效率较低等问题,实现了提高服务器故障修复的效率的技术效果。In an embodiment of the present application, a target fault alarm generated in a server is obtained; the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the associated fault alarm belonging to the root cause alarm; whether to report the target fault alarm is determined according to the target alarm type, that is, firstly the target fault alarm generated in the server is obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the target alarm type includes a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the associated fault alarm belonging to the root cause alarm, and finally, whether to report the target fault alarm is determined according to the target alarm type. Reporting target fault alarms avoids the situation where a large number of related alarm reports reduce the efficiency of server fault repair. The above technical solution solves the problem of low efficiency of server fault repair in related technologies and achieves the technical effect of improving the efficiency of server fault repair.
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the present application.
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, for ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative labor.
图1是根据本申请实施例的一种服务器故障根因的过滤方法的硬件环境示意图;FIG1 is a schematic diagram of a hardware environment of a method for filtering root causes of server failures according to an embodiment of the present application;
图2是根据本申请实施例的一种服务器故障根因的过滤方法的流程图;FIG2 is a flow chart of a method for filtering root causes of server failures according to an embodiment of the present application;
图3是根据本申请实施例的一种目标故障告警的产生的示意图;FIG3 is a schematic diagram of generating a target fault alarm according to an embodiment of the present application;
图4是根据本申请实施例的一种目标告警源的定位的示意图;FIG4 is a schematic diagram of positioning a target alarm source according to an embodiment of the present application;
图5是根据本申请实施例的一种目标故障告警的数据库的示意图;FIG5 is a schematic diagram of a target fault alarm database according to an embodiment of the present application;
图6是根据本申请实施例的一种服务器故障根因的过滤装置的结构框图。FIG6 is a structural block diagram of a device for filtering root causes of server failures according to an embodiment of the present application.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of this application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
本申请实施例所提供的方法实施例可以在计算机终端、设备终端或者类似的运算装置中执行。以运行在计算机终端上为例,图1是根据本申请实施例的一种服务器故障根因的过滤方法的硬件环境示意图。如图1所示,计算机终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU(MicroControllerUnit,微控制单元)或可编程逻辑器件FPGA(Field-Programmable Gate Array,现场可编程门阵列)等的处理装置)和被设置为存储数据的存储器104,在一个示例性实施例中,上述计算机终端还可以包括被设置为通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述计算机终端的结构造成限定。例如,计算机终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示等同功能或比图1所示功能更多的不同的配置。The method embodiment provided in the embodiment of the present application can be executed in a computer terminal, a device terminal or a similar computing device. Taking running on a computer terminal as an example, FIG1 is a hardware environment diagram of a method for filtering the root cause of a server failure according to an embodiment of the present application. As shown in FIG1 , the computer terminal may include one or more (only one is shown in FIG1 ) processors 102 (the processor 102 may include but is not limited to a microprocessor MCU (MicroController Unit) or a programmable logic device FPGA (Field-Programmable Gate Array) and other processing devices) and a memory 104 configured to store data. In an exemplary embodiment, the above-mentioned computer terminal may also include a transmission device 106 configured to have a communication function and an input and output device 108. It can be understood by those skilled in the art that the structure shown in FIG1 is only for illustration and does not limit the structure of the above-mentioned computer terminal. For example, the computer terminal may also include more or fewer components than those shown in FIG1 , or have a different configuration with the same function as that shown in FIG1 or more functions than those shown in FIG1 .
存储器104被设置为存储计算机程序,例如,应用软件的软件程序以及模块,如本申请实施例中的服务器故障根因的过滤方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
The memory 104 is configured to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the filtering method for the root cause of server failure in the embodiment of the present application. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, that is, to implement the above method. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
传输设备106被设置为经由一个网络接收或者发送数据。上述的网络实例可包括计算机终端的通信供应商提供的无线网络。在一个实例中,传输设备106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备106可以为射频(Radio Frequency,简称为RF)模块,其被设置为通过无线方式与互联网进行通讯。The transmission device 106 is configured to receive or send data via a network. The above-mentioned network example may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, referred to as NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 can be a radio frequency (Radio Frequency, referred to as RF) module, which is configured to communicate with the Internet wirelessly.
在本实施例中提供了一种服务器故障根因的过滤方法,应用于上述计算机终端,图2是根据本申请实施例的一种服务器故障根因的过滤方法的流程图,如图2所示,该流程包括如下步骤:In this embodiment, a method for filtering root causes of server failures is provided, which is applied to the above-mentioned computer terminal. FIG. 2 is a flow chart of a method for filtering root causes of server failures according to an embodiment of the present application. As shown in FIG. 2 , the process includes the following steps:
步骤S202,获取服务器中产生的目标故障告警;Step S202, obtaining a target fault alarm generated in the server;
步骤S204,根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的;Step S204: classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm types of the fault alarm include: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
步骤S206,根据目标告警类型确定是否上报目标故障告警。Step S206: Determine whether to report a target fault alarm according to the target alarm type.
通过上述步骤,首选获取服务器中产生的目标故障告警,然后根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,目标告警类型包括根因告警和关联告警,其中,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的,最后根据目标告警类型确定是否上报目标故障告警,避免了大量关联告警上报导致降低服务器故障修复的效率的情况出现。采用上述技术方案,解决了相关技术中,服务器故障修复的效率较低等问题,实现了提高服务器故障修复的效率的技术效果。Through the above steps, the target fault alarm generated in the server is first obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type. The target alarm type includes root cause alarm and associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm. Finally, it is determined whether to report the target fault alarm according to the target alarm type, avoiding the situation where a large number of associated alarms are reported, resulting in a reduction in the efficiency of server fault repair. The above technical solution solves the problems of low efficiency of server fault repair in related technologies, and achieves the technical effect of improving the efficiency of server fault repair.
在上述步骤S202提供的技术方案中,获取服务器中产生的目标故障告警。In the technical solution provided in the above step S202, a target fault alarm generated in the server is obtained.
可选地,在本实施例中,目标故障告警可以但不限于为服务器产生的关于任何器件、硬件异常的告警,硬件可以但不限于包括主板,机箱等等部件或者设备。Optionally, in this embodiment, the target fault alarm may be, but is not limited to, an alarm generated for the server regarding any device or hardware abnormality, and the hardware may be, but is not limited to, components or devices such as a mainboard and a chassis.
可选地,在本实施例中,上述多个设备可以构成一个业务集群,每个设备节点上部署了各种服务,这些服务之间也有软硬件依赖关系。Optionally, in this embodiment, the above-mentioned multiple devices may constitute a service cluster, and various services are deployed on each device node, and there are also software and hardware dependencies between these services.
在一个示例性实施例中,可以但不限于通过以下方式获取服务器中产生的目标故障告警:在检测到服务器中发生故障的情况下,采集故障的目标故障数据;根据目标故障数据对故障的告警源进行定位,得到目标告警源;根据目标告警源对故障进行恢复;在故障恢复失败的情况下,生成目标故障告警。In an exemplary embodiment, the target fault alarm generated in the server can be obtained in the following manner but is not limited to: when a fault is detected in the server, the target fault data of the fault is collected; the alarm source of the fault is located according to the target fault data to obtain the target alarm source; the fault is restored according to the target alarm source; and when the fault recovery fails, a target fault alarm is generated.
可选地,在本实施例中,图3是根据本申请实施例的一种目标故障告警的产生的示意图,如图3所示,在检测到服务器中发生故障的情况下,通过硬件采集过滤层采集故障的目标故障数据,基于目标故障数据通过故障诊断过滤层根据目标故障数据对故障的告警源进行定位,得到目标告警源,然后通过故障修复过滤层根据目标告警源对故障进行恢复,在故障恢复失败的情况下,生成目标故障告警。Optionally, in this embodiment, Figure 3 is a schematic diagram of the generation of a target fault alarm according to an embodiment of the present application. As shown in Figure 3, when a fault is detected in the server, the target fault data of the fault is collected through the hardware collection filtering layer, and the alarm source of the fault is located according to the target fault data through the fault diagnosis filtering layer based on the target fault data to obtain the target alarm source, and then the fault is restored according to the target alarm source through the fault repair filtering layer. If the fault recovery fails, a target fault alarm is generated.
可选地,在本实施例中,目标故障数据可以但不限于包括硬件采集过滤层采集到的硬件感知层感知到的数据,比如:硬件的温度,电压,RAS(Reliability,Availability and Serviceability,可靠性可用性和可维护性)信号等等。Optionally, in this embodiment, the target fault data may include but is not limited to data perceived by the hardware perception layer collected by the hardware acquisition filtering layer, such as: hardware temperature, voltage, RAS (Reliability, Availability and Serviceability) signals, etc.
可选地,在本实施例中,如图3所示,从最底层开始的硬件感知层提供器件、部件、主板、机箱的基本硬件信息能力;多个设备节点构成一个业务集群;每个设备节点上部署了各种服务,这些服务之间也有软硬件依赖关系;在节点机箱管理服务、业务管理服务之上是集群管理(CM,Cluster Mangement)。在集群管理层,需要将各个节点的故障告警信息汇聚,然后再通过设计的告警依赖关系进行告警抑制、根因告警过滤器抑制告警;也可以通过智能推理过滤器进行告警抑制、根因告警;在上述根因告警过滤框架中,通过四个过滤层次完成根因的层层阻断,最终完成根因告警方案的实现。Optionally, in this embodiment, as shown in FIG3 , the hardware perception layer starting from the bottom layer provides basic hardware information capabilities of devices, components, motherboards, and chassis; multiple device nodes constitute a business cluster; various services are deployed on each device node, and there are also software and hardware dependencies between these services; above the node chassis management service and the business management service is cluster management (CM, Cluster Mangement). At the cluster management level, it is necessary to aggregate the fault alarm information of each node, and then suppress the alarm through the designed alarm dependency, and suppress the alarm through the root cause alarm filter; it is also possible to suppress the alarm and the root cause alarm through the intelligent reasoning filter; in the above-mentioned root cause alarm filtering framework, the root cause is blocked layer by layer through four filtering levels, and finally the root cause alarm solution is implemented.
在一个示例性实施例中,可以但不限于通过以下方式采集故障的目标故障数据:对故障进行目标次数的重试;在重试失败的情况下,采集故障的初始故障数据;剔除初始故障数据中超出目标数据区间的数据,得到参考故障数据;对参考故障数据进行平均数运算,得到目标故障数据。
In an exemplary embodiment, target fault data of a fault may be collected in the following manner, but is not limited to: retrying the fault a target number of times; if the retry fails, collecting initial fault data of the fault; eliminating data in the initial fault data that exceeds the target data interval to obtain reference fault data; and performing an average operation on the reference fault data to obtain target fault data.
可选地,在本实施例中,对故障进行目标次数的重试;在重试失败的情况下,采集故障的初始故障数据可以但不限于是指对于一个故障,可以通过失败多次重试再报错到故障诊断层,过滤器件,避免因环境干扰等出现的瞬时故障;Optionally, in this embodiment, the target number of retries are performed on the fault; in the case of retry failure, the initial fault data of the fault may be collected, but is not limited to, for a fault, multiple retries after failure are reported to the fault diagnosis layer, filtering devices, and avoiding instantaneous faults caused by environmental interference, etc.;
可选地,在本实施例中,剔除初始故障数据中超出目标数据区间的数据,得到参考故障数据可以但不限于是指设置数据合理区间,超过合理区间数据判断为假值丢弃方式剔除瞬时假值,不上报故障诊断层;Optionally, in this embodiment, data exceeding the target data interval in the initial fault data is eliminated to obtain reference fault data, which may refer to, but is not limited to, setting a reasonable data interval, and discarding data exceeding the reasonable interval as false values to eliminate instantaneous false values, and not reporting to the fault diagnosis layer;
可选地,在本实施例中,对参考故障数据进行平均数运算,得到目标故障数据可以但不限于是指通过平均数算法,对数据取平均数之后再上报到故障诊断层。Optionally, in this embodiment, performing an average operation on the reference fault data to obtain the target fault data may include, but is not limited to, taking the average of the data through an average algorithm and then reporting it to the fault diagnosis layer.
在一个示例性实施例中,可以但不限于通过以下方式根据目标故障数据对故障的告警源进行定位,得到目标告警源:从具有对应关系的故障数据和故障原因中获取目标故障数据对应的故障原因作为候选故障原因;根据服务器中设备的拓扑关系和目标故障数据从候选故障原因中查找目标故障原因;将目标故障原因在服务器中对应的现场可更换单元FRU确定为目标告警源。In an exemplary embodiment, the alarm source of the fault can be located according to the target fault data to obtain the target alarm source in the following manner but not limited to: obtaining the fault cause corresponding to the target fault data from the fault data and fault causes with a corresponding relationship as a candidate fault cause; searching for the target fault cause from the candidate fault causes according to the topological relationship of the equipment in the server and the target fault data; and determining the field replaceable unit FRU corresponding to the target fault cause in the server as the target alarm source.
可选地,在本实施例中,图4是根据本申请实施例的一种目标告警源的定位的示意图,如图4所示,从具有对应关系的故障数据和故障原因中获取目标故障数据对应的故障原因作为候选故障原因,比如,主板A的机箱管理服务硬件采集过滤层采集到目标故障数据指示FRU N(FRU,Field Replace Unit,现场可更换单元)上的IIC(Inter-Integrated Circuit,集成电路总线)传感器D失败,对应的候选故障原因可能包括:Optionally, in this embodiment, FIG. 4 is a schematic diagram of positioning a target alarm source according to an embodiment of the present application. As shown in FIG. 4 , the fault cause corresponding to the target fault data is obtained from the fault data and the fault cause having a corresponding relationship as a candidate fault cause. For example, the chassis management service hardware collection filter layer of the mainboard A collects the target fault data indicating that the IIC (Inter-Integrated Circuit) sensor D on the FRU N (FRU, Field Replace Unit) fails, and the corresponding candidate fault causes may include:
1、主板A的MCU B IIC控制器故障;1. The MCU B IIC controller of the mainboard A is faulty;
2、主板A到IIC switch的IIC1通道故障;2. The IIC1 channel from mainboard A to IIC switch is faulty;
3、IIC switch C芯片故障;3. IIC switch C chip failure;
4、IIC switch C到FRU N上的IIC 2通道故障;4. IIC switch C to IIC 2 channel on FRU N is faulty;
5、FRU N IIC传感器D故障。5. FRU N IIC sensor D is faulty.
根据服务器中设备的拓扑关系(主板A的MCU B通过IIC 1与IIC Switch C连接,IIC Switch C通过IIC 2与FRU N中的传感器D连接,IIC Switch C通过IIC 3与FRU N中的传感器E连接,IIC Switch C通过IIC 4与FRU M中的传感器F连接)和目标故障数据从候选故障原因中查找目标故障原因;将目标故障原因在服务器中对应的现场可更换单元FRU确定为目标告警源。According to the topological relationship of the devices in the server (MCU B of mainboard A is connected to IIC Switch C through IIC 1, IIC Switch C is connected to sensor D in FRU N through IIC 2, IIC Switch C is connected to sensor E in FRU N through IIC 3, and IIC Switch C is connected to sensor F in FRU M through IIC 4) and target fault data, the target fault cause is found from the candidate fault causes; the field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source.
在一个示例性实施例中,可以但不限于通过以下方式根据服务器中设备的拓扑关系和目标故障数据从候选故障原因中查找目标故障原因:从服务器中设备的拓扑关系中查找目标故障数据对应的目标拓扑关系;根据目标拓扑关系中设备的运行状态对候选故障原因进行排查,得到目标故障原因。In an exemplary embodiment, the target fault cause can be found from the candidate fault causes based on the topological relationship of the devices in the server and the target fault data in the following manner, but is not limited to: finding the target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server; and checking the candidate fault causes according to the operating status of the devices in the target topological relationship to obtain the target fault cause.
可选地,在本实施例中,根据目标拓扑关系中设备的运行状态对候选故障原因进行排查,得到目标故障原因,如图4所示,比如,Optionally, in this embodiment, the candidate fault causes are checked according to the running status of the devices in the target topology relationship to obtain the target fault cause, as shown in FIG4 , for example,
已知MCU B的上述硬件拓扑,如果硬件采集过滤层报告MCU B访问传感器D、传感器E、传感器F均失败,则判定主板A的IIC 1故障,报告主板A IIC 1通道故障,目标故障原因可能包括:Given the above hardware topology of MCU B, if the hardware acquisition filter layer reports that MCU B fails to access sensor D, sensor E, and sensor F, it is determined that the IIC 1 of mainboard A is faulty, and the mainboard A IIC 1 channel fault is reported. The target fault causes may include:
1、主板A的MCU B IIC控制器故障;1. The MCU B IIC controller of the mainboard A is faulty;
2、主板A到IIC switch的IIC1通道故障2. The IIC1 channel from motherboard A to IIC switch is faulty
3、IIC switch C芯片故障。3. IIC switch C chip failure.
已知MCU B的上述硬件拓扑,如果硬件采集过滤层报告MCU B访问传感器D、传感器E、传感器F中某一个故障,另外两个传感器访问正常,则判断为FRU传感器访问故障,目标故障原因可能包括:Given the above hardware topology of MCU B, if the hardware collection filter layer reports that MCU B fails to access one of sensor D, sensor E, and sensor F, and the other two sensors are accessible normally, it is judged as a FRU sensor access failure. The target failure causes may include:
1、IIC switch C到FRU N上的IIC 2通道故障;1. IIC switch C to IIC 2 channel on FRU N is faulty;
2、FRU N IIC传感器D故障。2. FRU N IIC sensor D is faulty.
在一个示例性实施例中,可以但不限于通过以下方式根据目标告警源对故障进行恢复:从具有对应关系的告警源和恢复流程中获取目标告警源对应的目标恢复流程;在获取到目标恢复流程的情况下,执行目标恢复流程;在未获取到目标恢复流程,或者,目标恢复流程执行失败的情况下,确定故障恢复失败。
In an exemplary embodiment, the fault can be recovered according to the target alarm source in the following manner but is not limited to: obtaining the target recovery process corresponding to the target alarm source from the alarm sources and recovery processes with corresponding relationships; when the target recovery process is obtained, executing the target recovery process; when the target recovery process is not obtained, or the target recovery process fails to execute, determining that the fault recovery has failed.
可选地,在本实施例中,如图3所示,将目标故障原因在服务器中对应的现场可更换单元FRU确定为目标告警源,目标告警源在故障修复过滤层获取目标告警源对应的目标恢复流程,在获取到目标恢复流程的情况下,执行目标恢复流程,故障修复过滤层负责对误入异常状态的软硬件系统进行自动恢复,避免事态异常呆滞、扩大,比如,1、状态机因某种低概率触发原因进入到异常状态无法完成正常协商,设备无法正常接入系统。可以通过retraining(再训练)机制、或者对端点设备下电上电重启训练协商,将设备接入系统,提升了设备可用性,避免产生告警;2、对于某些IIC总线因某器件/环境瞬时异常拉死,导致IIC器件访问失败。可以通过复位IIC设备树等措施,将IIC总线恢复。Optionally, in this embodiment, as shown in FIG3 , the field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source, and the target alarm source obtains the target recovery process corresponding to the target alarm source in the fault repair filter layer. When the target recovery process is obtained, the target recovery process is executed. The fault repair filter layer is responsible for automatically recovering the software and hardware systems that have entered the abnormal state by mistake, so as to avoid abnormal stagnation and expansion of the situation. For example, 1. The state machine enters the abnormal state due to some low-probability triggering reason and cannot complete the normal negotiation, and the device cannot be normally connected to the system. The device can be connected to the system through the retraining mechanism, or by powering off and powering on the endpoint device to restart the training negotiation, thereby improving the device availability and avoiding the generation of alarms; 2. For some IIC buses, due to the instantaneous abnormality of a certain device/environment, the IIC device access fails. The IIC bus can be restored by resetting the IIC device tree and other measures.
在一个示例性实施例中,可以但不限于通过以下方式生成目标故障告警:判断目标故障数据是否落入告警阈值范围内;在目标故障数据落入告警阈值范围内的情况下,生成目标故障告警。In an exemplary embodiment, the target fault alarm may be generated in the following manner, but is not limited to: determining whether the target fault data falls within the alarm threshold range; and generating the target fault alarm if the target fault data falls within the alarm threshold range.
可选地,在本实施例中,告警阈值范围需要设置合理,如温度/电压的回差设计,避免告警反复发生的乒乓效应。例如,某温度的告警值为39摄氏度,告警恢复值设定为37摄氏度。当真实温度在39摄氏度上下徘徊时,就可以产生稳定告警而不导致告警/恢复反复产生。Optionally, in this embodiment, the alarm threshold range needs to be set reasonably, such as the temperature/voltage hysteresis design, to avoid the ping-pong effect of repeated alarms. For example, the alarm value of a certain temperature is 39 degrees Celsius, and the alarm recovery value is set to 37 degrees Celsius. When the actual temperature hovers around 39 degrees Celsius, a stable alarm can be generated without causing repeated alarms/recovery.
在上述步骤S204提供的技术方案中,根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的。In the technical solution provided in the above step S204, the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, wherein the alarm types of the fault alarm include: root cause alarm and associated alarm, the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm.
可选地,在本实施例中,如图4所示,假如IIC Switch C发生故障告警,传感器D发生故障告警,并且,IIC Switch C发生故障是引起传感器D故障的根本原因,那么,IIC Switch C的故障告警为根因告警,传感器D的故障告警为关联告警。Optionally, in this embodiment, as shown in Figure 4, if a fault alarm occurs in IIC Switch C, a fault alarm occurs in sensor D, and the fault of IIC Switch C is the root cause of the fault of sensor D, then the fault alarm of IIC Switch C is a root cause alarm, and the fault alarm of sensor D is an associated alarm.
在一个示例性实施例中,可以但不限于通过以下方式根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型:从目标故障告警中查找关联告警字段,其中,关联告警字段用于指示目标故障告警是否为关联告警,第一告警信息包括关联告警字段;在关联告警字段用于指示目标故障告警为关联告警的情况下,确定目标告警类型为关联告警;在关联告警字段用于指示目标故障告警不为关联告警的情况下,确定目标告警类型为根因告警。In an exemplary embodiment, the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but not limited to: searching the associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field; when the associated alarm field is used to indicate that the target fault alarm is an associated alarm, determining that the target alarm type is an associated alarm; when the associated alarm field is used to indicate that the target fault alarm is not an associated alarm, determining that the target alarm type is a root cause alarm.
可选地,在本实施例中,关联告警字段可以但不限于包括告警ID(Identity document,身份标识号码),其中,告警ID可以为告警类型编码,全局唯一,该字段是区分某种告警事件类型的唯一身份索引识别字段。Optionally, in this embodiment, the associated alarm field may include, but is not limited to, an alarm ID (Identity document), wherein the alarm ID may be an alarm type code, which is globally unique, and this field is a unique identity index identification field for distinguishing a certain type of alarm event.
可选地,在本实施例中,根据关联告警字段可以确定目标故障告警的目标告警类型,目标告警类型可以指示目标故障告警是否对其它告警有关联依赖关系,在目标告警类型指示目标故障告警对其它告警有关联依赖关系,即为关联告警的情况下确定目标告警类型为关联告警,则需要判断是否有根因告警;如果该告警无依赖,目标故障告警不为关联告警的情况下,确定目标告警类型为根因告警,可直接上报。Optionally, in this embodiment, the target alarm type of the target fault alarm can be determined based on the associated alarm field. The target alarm type can indicate whether the target fault alarm has an associated dependency on other alarms. When the target alarm type indicates that the target fault alarm has an associated dependency on other alarms, that is, it is an associated alarm, the target alarm type is determined to be an associated alarm, and it is necessary to determine whether there is a root cause alarm; if the alarm has no dependency and the target fault alarm is not an associated alarm, the target alarm type is determined to be a root cause alarm and can be reported directly.
在一个示例性实施例中,可以但不限于通过以下方式根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型:从目标故障告警中提取目标告警特征,其中,目标告警特征用于指示目标故障告警的发生原因,第一告警信息包括目标告警特征;根据目标告警特征对目标故障告警进行分类,得到目标告警类型。In an exemplary embodiment, the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but is not limited to: extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate the cause of the target fault alarm, and the first alarm information includes the target alarm feature; classifying the target fault alarm according to the target alarm feature to obtain the target alarm type.
可选地,在本实施例中,对目标故障告警进行分类可以但不限于基于目标故障告警对应的目标告警特征进行判定。Optionally, in this embodiment, the target fault alarm may be classified by, but is not limited to, determining based on a target alarm feature corresponding to the target fault alarm.
在一个示例性实施例中,可以但不限于通过以下方式根据目标告警特征对目标故障告警进行分类,得到目标告警类型:在目标告警特征用于指示目标故障告警的发生原因为其他故障告警的情况下,确定目标告警类型为关联告警;在目标告警特征用于指示目标故障告警的发生原因为服务器中的硬件设备的情况下,确定目标告警类型为根因告警。
In an exemplary embodiment, the target fault alarm can be classified according to the target alarm feature to obtain the target alarm type in the following manner but not limited to: when the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, the target alarm type is determined to be an associated alarm; when the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be a root cause alarm.
可选地,在本实施例中,在目标告警特征用于指示目标故障告警的发生原因为服务器中的硬件设备的情况下,可以但不限于包括硬件的物理损坏,此时可以确定目标告警类型为根因告警。Optionally, in this embodiment, when the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, it may include but is not limited to physical damage to the hardware. In this case, the target alarm type may be determined to be a root cause alarm.
在一个示例性实施例中,可以但不限于通过以下方式根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型:将目标故障告警输入目标告警分类模型,其中,目标告警分类模型是使用标注了根因告警的第一告警样本和标注了关联告警类型的第二告警样本对初始告警分类模型进行训练得到的;获取目标告警分类模型输出的目标告警类型。In an exemplary embodiment, the target fault alarm can be classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type in the following manner but not limited to: the target fault alarm is input into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with a root cause alarm and a second alarm sample labeled with an associated alarm type; and the target alarm type output by the target alarm classification model is obtained.
可选地,在本实施例中,目标告警分类模型可以对输入的目标故障告警进行分类,确定目标故障告警的目标告警类型。Optionally, in this embodiment, the target alarm classification model may classify the input target fault alarm to determine the target alarm type of the target fault alarm.
在上述步骤S206提供的技术方案中,根据目标告警类型确定是否上报目标故障告警。In the technical solution provided in the above step S206, whether to report the target fault alarm is determined according to the target alarm type.
可选地,在本实施例中,是否上报目标故障告警取决于目标告警类型,为了避免产生大量的关联告警,干扰系统故障的根本原因判断,加速修复效率,可以将目标告警类型为根因告警的目标故障告警上报。Optionally, in this embodiment, whether to report the target fault alarm depends on the target alarm type. In order to avoid generating a large number of related alarms, interfering with the root cause judgment of the system fault, and accelerating the repair efficiency, the target fault alarm with the target alarm type as the root cause alarm can be reported.
在一个示例性实施例中,可以但不限于通过以下方式根据目标告警类型确定是否上报目标故障告警:在目标告警类型为根因告警的情况下,上报目标故障告警;在目标告警类型为关联告警的情况下,根据目标故障告警携带的第二告警信息确定是否上报目标故障告警。In an exemplary embodiment, whether to report a target fault alarm can be determined according to the target alarm type in the following manner but is not limited to: when the target alarm type is a root cause alarm, the target fault alarm is reported; when the target alarm type is an associated alarm, whether to report the target fault alarm is determined according to the second alarm information carried by the target fault alarm.
可选地,在本实施例中,在目标告警类型为关联告警的情况下,根据目标故障告警携带的第二告警信息确定是否上报目标故障告警。Optionally, in this embodiment, when the target alarm type is an associated alarm, it is determined whether to report the target fault alarm according to the second alarm information carried by the target fault alarm.
在一个示例性实施例中,可以但不限于通过以下方式根据目标故障告警携带的第二告警信息确定是否上报目标故障告警:获取目标故障告警对应的目标关联周期,其中,目标关联周期用于指示与目标故障告警具有关联关系的属于根因告警的目标关联故障告警所在的时间区间;在目标故障告警的获取时间前后目标关联周期的时间范围内查找是否获取到目标关联故障告警;在查找到目标关联故障告警的情况下,忽略目标故障告警;在未查找到目标关联故障告警的情况下,上报目标故障告警。In an exemplary embodiment, it is possible but not limited to determine whether to report the target fault alarm based on the second alarm information carried by the target fault alarm in the following manner: obtain the target association period corresponding to the target fault alarm, wherein the target association period is used to indicate the time interval in which the target associated fault alarm belonging to the root cause alarm that has an association relationship with the target fault alarm is located; search whether the target associated fault alarm is obtained within the time range of the target association period before and after the acquisition time of the target fault alarm; if the target associated fault alarm is found, ignore the target fault alarm; if the target associated fault alarm is not found, report the target fault alarm.
可选地,在本实施例中,目标关联周期可以但不限于为关联告警根因报告时间区间,如果在关联周期内,根因告警产生,则本告警无效,不需报告;如果在关联周期内根因告警未产生,则本告警有效上报,其中,目标关联周期的设计可以根据目标故障告警具有关联关系的属于根因告警的目标关联故障告警与目标故障告警上报的时间差。例如,目标故障告警与目标关联故障告警上报的最大可能时间差是1分钟,则关联周期可以设置为1分钟。该属性可以保存在集群告警根因过滤层,目标故障告警对应的事件数据库中,作为该目标故障告警的固有属性,当关联告警报告给CM后,CM需要判断告警报告前后1分钟内是否有根因告警上报,如果有,则关联告警不需要报告,只报告根因告警即可。Optionally, in this embodiment, the target association period can be, but is not limited to, the time interval for reporting the root cause of the associated alarm. If the root cause alarm is generated within the association period, the alarm is invalid and does not need to be reported; if the root cause alarm is not generated within the association period, the alarm is effectively reported, wherein the design of the target association period can be based on the time difference between the target associated fault alarm and the target fault alarm that are associated with the root cause alarm. For example, if the maximum possible time difference between the target fault alarm and the target associated fault alarm is 1 minute, the association period can be set to 1 minute. This attribute can be saved in the cluster alarm root cause filtering layer and the event database corresponding to the target fault alarm as an inherent attribute of the target fault alarm. After the associated alarm is reported to the CM, the CM needs to determine whether there is a root cause alarm reported within 1 minute before and after the alarm report. If so, the associated alarm does not need to be reported, and only the root cause alarm is reported.
在一个示例性实施例中,在目标故障告警的获取时间前后目标关联周期的时间范围内查找是否获取到目标关联故障告警之前,方法还包括以下之一:In an exemplary embodiment, before searching whether a target associated fault alarm is acquired within a time range of a target associated period before and after the acquisition time of the target fault alarm, the method further includes one of the following:
从具有对应关系的故障告警和关联故障告警中查找目标故障告警对应的目标关联故障告警;Searching for a target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship;
从目标故障告警中提取关联故障告警字段,其中,关联故障告警字段用于记录与目标故障告警具有关联关系的属于根因告警的目标关联故障告警。An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.
可选地,在本实施例中,图5是根据本申请实施例的一种目标故障告警的数据库的示意图,如图5所示,从具有对应关系的故障告警和关联故障告警中查找目标故障告警对应的目标关联故障告警,比如,已知目标故障告警的告警ID,从具有对应关系的故障告警和关联故障告警中查找目标故障告警(告警ID)对应的目标关联故障告警(根因告警ID 1和根因告警ID N)。Optionally, in this embodiment, Figure 5 is a schematic diagram of a target fault alarm database according to an embodiment of the present application. As shown in Figure 5, a target associated fault alarm corresponding to the target fault alarm is searched from the fault alarms and associated fault alarms with corresponding relationships. For example, the alarm ID of the target fault alarm is known, and the target associated fault alarm (root cause alarm ID 1 and root cause alarm ID N) corresponding to the target fault alarm (alarm ID) is searched from the fault alarms and associated fault alarms with corresponding relationships.
可选地,在本实施例中,如图5所示,根据关联告警及根因告警设计,确定关联告警可能的根因告警。该属性保存在集群告警根因过滤层该告警事件数据库中,作为该告警的固有属性。目标故障告警上报到CM之后,CM在告警根因过滤层依照数据库中保存的根因告警,查找根因告警是否有报告,如果有根因告警,则不需要再报告关联告警,只报告根因告警即可。如果关联告警没有找到根因告警,则关联告警本身就是根因,可
以报告。Optionally, in this embodiment, as shown in FIG5 , based on the design of associated alarms and root cause alarms, the possible root cause alarms of the associated alarms are determined. This attribute is stored in the alarm event database of the cluster alarm root cause filtering layer as an inherent attribute of the alarm. After the target fault alarm is reported to the CM, the CM searches the root cause alarms stored in the database at the alarm root cause filtering layer to find out whether the root cause alarm has been reported. If there is a root cause alarm, there is no need to report the associated alarms, and only the root cause alarms need to be reported. If the associated alarms do not find the root cause alarms, the associated alarms themselves are the root cause, which can be To report.
可选地,在本实施例中,一种目标故障告警的目标告警类型在设计时,先分析清楚该目标故障告警是对问题的根因告警,还是对已有告警有关联依赖,即可能由其它根因告警产生的故障传导结果;确认【是否关联告警】字段的属性,该属性保存在集群告警根因过滤层该告警事件数据库中,作为该告警的固有属性。Optionally, in this embodiment, when designing a target alarm type of a target fault alarm, it is first analyzed whether the target fault alarm is a root cause alarm for the problem, or has an associated dependency on an existing alarm, that is, it may be a fault transmission result generated by other root cause alarms; confirm the attribute of the [Is Alarm Associated] field, and the attribute is saved in the alarm event database of the cluster alarm root cause filter layer as an inherent attribute of the alarm.
在一个示例性实施例中,在根据目标告警类型确定是否上报目标故障告警之后,可以但不限于包括以下方式:根据目标故障告警携带的第三告警信息确定目标故障告警对应的恢复时机;在检测到服务器达到恢复时机的情况下,恢复目标故障告警。In an exemplary embodiment, after determining whether to report a target fault alarm based on the target alarm type, the following methods may be used, but are not limited to: determining a recovery timing corresponding to the target fault alarm based on the third alarm information carried by the target fault alarm; and restoring the target fault alarm when it is detected that the server has reached the recovery timing.
可选地,在本实施例中,目标故障告警可以通过不同的方式恢复,在检测到服务器达到恢复时机的情况下,也就是说,检测到对应的恢复事件发生,则恢复目标故障告警。Optionally, in this embodiment, the target fault alarm can be restored in different ways. When it is detected that the server reaches the recovery timing, that is, when the corresponding recovery event is detected, the target fault alarm is restored.
在一个示例性实施例中,可以但不限于通过以下方式根据目标故障告警携带的第三告警信息确定目标故障告警对应的恢复时机:从目标故障告警中查找重启恢复字段,其中,重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警是否恢复,第三告警信息包括重启恢复字段;在重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警恢复的情况下,确定恢复时机为目标设备重启。In an exemplary embodiment, the recovery timing corresponding to the target fault alarm can be determined based on the third alarm information carried by the target fault alarm in the following manner, but is not limited to: searching the restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is restored after the target device that generated the target fault alarm is restarted, and the third alarm information includes the restart recovery field; when the restart recovery field is used to indicate that the target fault alarm is restored after the target device that generated the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.
可选地,在本实施例中,如图5所示,如果从目标故障告警中查找重启恢复字段,比如【重启是否恢复】为“是”的情况下,确定恢复时机为目标设备重启。针对设备重启/上下电特定场景,需要确定【重启是否恢复】字段属性,该属性保存在集群告警根因过滤层该告警事件数据库中,作为该告警的固有属性。如果该告警重启后会恢复,CM需要在设备重启后上报告警已恢复,并更新本地数据库;如果该告警重启后不恢复,则CM在设备重启后不报告告警恢复。Optionally, in this embodiment, as shown in FIG5 , if the restart recovery field is searched from the target fault alarm, for example, when [Restart Recovery] is “Yes”, the recovery timing is determined to be the restart of the target device. For specific scenarios of device restart/power on and off, it is necessary to determine the attribute of the [Restart Recovery] field, which is stored in the alarm event database of the cluster alarm root cause filter layer as an inherent attribute of the alarm. If the alarm will be restored after restart, the CM needs to report that the alarm has been restored after the device is restarted and update the local database; if the alarm will not be restored after restart, the CM will not report the alarm recovery after the device is restarted.
在一个示例性实施例中,可以但不限于通过以下方式在检测到服务器达到恢复时机的情况下,恢复目标故障告警:检测目标设备是否被执行重启操作;在检测到目标设备被执行重启操作且目标设备重启成功的情况下,恢复目标故障告警。In an exemplary embodiment, the target fault alarm can be restored when it is detected that the server has reached the recovery time in the following manner but is not limited to: detecting whether the target device has been restarted; and restoring the target fault alarm when it is detected that the target device has been restarted and the target device has been restarted successfully.
可选地,在本实施例中,检测目标设备是否被执行重启操作,在检测到目标设备被执行重启操作且目标设备重启成功的情况下,表示检测到了恢复时机,则恢复目标故障告警。Optionally, in this embodiment, it is detected whether a restart operation is executed on the target device. If it is detected that a restart operation is executed on the target device and the target device restarts successfully, it means that a recovery opportunity is detected, and the target fault alarm is restored.
在一个示例性实施例中,在从目标故障告警中查找重启恢复字段之后,可以但不限于包括以下方式:在重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警不恢复的情况下,从目标故障告警中查找设备标识字段,其中,设备标识字段用于指示产生目标故障告警的目标设备的目标设备标识;确定恢复时机为目标设备所在位置上的设备标识更换。In an exemplary embodiment, after searching for the restart recovery field in the target fault alarm, the following methods may be used but are not limited to: when the restart recovery field is used to indicate that the target fault alarm is not recovered after the target device that generates the target fault alarm is restarted, searching for the device identification field from the target fault alarm, wherein the device identification field is used to indicate the target device identification of the target device that generates the target fault alarm; and determining the recovery timing as the replacement of the device identification at the location of the target device.
可选地,在本实施例中,如图5所示,针对【重启是否恢复】为重启不恢复的告警,表明该告警为硬件/设备故障,不随系统重启而恢复;此类故障必须等客户/服务更换之后才能恢复,因此需要CM收到该位置设备唯一识别信息(如SN(series number,序列号)变化告警后,恢复告警;Optionally, in this embodiment, as shown in FIG. 5 , for the alarm that [Restart to Recover] is not recovered after restart, it indicates that the alarm is a hardware/equipment failure and is not recovered with the system restart; such failure can only be recovered after the customer/service is replaced, so the CM needs to recover the alarm after receiving the alarm of the change of the unique identification information of the location device (such as SN (series number));
在一个示例性实施例中,可以但不限于通过以下方式在检测到服务器达到恢复时机的情况下,恢复目标故障告警:检测目标设备所在位置上的设备标识;在检测到目标设备所在位置上的设备标识从目标设备标识更换为参考设备标识的情况下,恢复目标故障告警。In an exemplary embodiment, the target fault alarm can be restored when it is detected that the server has reached the recovery time in the following manner, but is not limited to: detecting the device identifier at the location of the target device; and restoring the target fault alarm when it is detected that the device identifier at the location of the target device has changed from the target device identifier to the reference device identifier.
可选地,在本实施例中,在检测到目标设备所在位置上的设备标识从目标设备标识更换为参考设备标识的情况下,说明目标设备已经更换,则恢复目标故障告警。Optionally, in this embodiment, when it is detected that the device identifier at the location of the target device is changed from the target device identifier to the reference device identifier, it indicates that the target device has been replaced, and the target fault alarm is restored.
经过上述四层(硬件采集过滤层、故障诊断过滤层、故障修复过滤层、告警根因过滤层),管软客户接口能够对客户/服务提供真正的故障告警,客户/服务能够依据精确的故障告警进行设备维护。通过四层故障过滤,提供了准确的告警报告,有效提高服务的准确性和服务效率;不会因为多告警引发客户恐慌,减少设备、服务提供商的直接、间接经济损失;对故障进行探测、诊断、修复、汇聚、根因诊断、根因告警,确保问题发生后能够进行根因告警,提升问题解决效率,降低RTO、RPO,提升客户满意度。After the above four layers (hardware collection filter layer, fault diagnosis filter layer, fault repair filter layer, alarm root cause filter layer), the management software customer interface can provide real fault alarms to customers/services, and customers/services can perform equipment maintenance based on accurate fault alarms. Through four layers of fault filtering, accurate alarm reports are provided, effectively improving the accuracy and efficiency of services; customers will not panic due to multiple alarms, reducing direct and indirect economic losses of equipment and service providers; faults are detected, diagnosed, repaired, aggregated, root cause diagnosed, and root cause alarmed, ensuring that root cause alarms can be issued after problems occur, improving problem solving efficiency, reducing RTO and RPO, and improving customer satisfaction.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必
需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个非易失性可读存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例的方法。Through the description of the above implementation modes, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by software. The required general hardware platform can be implemented, of course, it can also be implemented through hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present application can essentially or contribute to the prior art in the form of a software product, which is stored in a non-volatile readable storage medium (such as ROM/RAM, disk, CD), including a number of instructions to enable a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of each embodiment of the present application.
图6是根据本申请实施例的一种服务器故障根因的过滤装置的结构框图;如图6所示,包括:FIG6 is a structural block diagram of a device for filtering root causes of server failures according to an embodiment of the present application; as shown in FIG6 , the device comprises:
获取模块602,被设置为获取服务器中产生的目标故障告警;An acquisition module 602 is configured to acquire a target fault alarm generated in a server;
分类模块604,被设置为根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的;The classification module 604 is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;
第一确定模块606,被设置为根据目标告警类型确定是否上报目标故障告警。The first determination module 606 is configured to determine whether to report a target fault alarm according to a target alarm type.
通过上述实施例,首先获取服务器中产生的目标故障告警,然后根据目标故障告警携带的第一告警信息对目标故障告警进行分类,得到目标告警类型,目标告警类型包括根因告警和关联告警,其中,根因告警用于指示对应的故障告警是引起服务器故障的根本原因,关联告警用于指示对应的故障告警是由所关联的属于根因告警的故障告警引起的,最后根据目标告警类型确定是否上报目标故障告警,避免了大量关联告警上报导致降低服务器故障修复的效率的情况出现。采用上述技术方案,解决了相关技术中,服务器故障修复的效率较低等问题,实现了提高服务器故障修复的效率的技术效果。Through the above embodiment, the target fault alarm generated in the server is first obtained, and then the target fault alarm is classified according to the first alarm information carried by the target fault alarm to obtain the target alarm type, which includes root cause alarm and associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server failure, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm. Finally, it is determined whether to report the target fault alarm according to the target alarm type, avoiding the situation where a large number of associated alarms are reported, resulting in a reduction in the efficiency of server fault repair. The above technical solution solves the problems of low efficiency of server fault repair in related technologies, and achieves the technical effect of improving the efficiency of server fault repair.
在一个示例性实施例中,分类模块,包括:In an exemplary embodiment, the classification module includes:
第一查找单元,被设置为从目标故障告警中查找关联告警字段,其中,关联告警字段用于指示目标故障告警是否为关联告警,第一告警信息包括关联告警字段;A first search unit is configured to search for an associated alarm field from a target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is an associated alarm, and the first alarm information includes the associated alarm field;
第一确定单元,被设置为在关联告警字段用于指示目标故障告警为关联告警的情况下,确定目标告警类型为关联告警;A first determining unit is configured to determine that the target alarm type is an associated alarm when the associated alarm field is used to indicate that the target fault alarm is an associated alarm;
第二确定单元,被设置为在关联告警字段用于指示目标故障告警不为关联告警的情况下,确定目标告警类型为根因告警。The second determining unit is configured to determine that the target alarm type is a root cause alarm when the associated alarm field is used to indicate that the target fault alarm is not an associated alarm.
在一个示例性实施例中,分类模块,包括:In an exemplary embodiment, the classification module includes:
提取单元,被设置为从目标故障告警中提取目标告警特征,其中,目标告警特征用于指示目标故障告警的发生原因,第一告警信息包括目标告警特征;an extraction unit, configured to extract a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;
分类单元,被设置为根据目标告警特征对目标故障告警进行分类,得到目标告警类型。The classification unit is configured to classify the target fault alarm according to the target alarm feature to obtain the target alarm type.
在一个示例性实施例中,分类单元,还被设置为:In an exemplary embodiment, the classification unit is further configured to:
在目标告警特征用于指示目标故障告警的发生原因为其他故障告警的情况下,确定目标告警类型为关联告警;In the case where the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, determining the target alarm type to be a related alarm;
在目标告警特征用于指示目标故障告警的发生原因为服务器中的硬件设备的情况下,确定目标告警类型为根因告警。When the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be a root cause alarm.
在一个示例性实施例中,分类模块,包括:In an exemplary embodiment, the classification module includes:
输入单元,被设置为将目标故障告警输入目标告警分类模型,其中,目标告警分类模型是使用标注了根因告警的第一告警样本和标注了关联告警类型的第二告警样本对初始告警分类模型进行训练得到的;An input unit is configured to input a target fault alarm into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with a root cause alarm and a second alarm sample labeled with an associated alarm type;
获取单元,被设置为获取目标告警分类模型输出的目标告警类型。The acquisition unit is configured to acquire the target alarm type output by the target alarm classification model.
在一个示例性实施例中,第一确定模块,包括:In an exemplary embodiment, the first determining module includes:
上报单元,被设置为在目标告警类型为根因告警的情况下,上报目标故障告警;The reporting unit is configured to report a target fault alarm when the target alarm type is a root cause alarm;
第三确定单元,被设置为在目标告警类型为关联告警的情况下,根据目标故障告警携带的第二告警信息确定是否上报目标故障告警。The third determination unit is configured to determine whether to report the target fault alarm according to the second alarm information carried by the target fault alarm when the target alarm type is an associated alarm.
在一个示例性实施例中,第三确定单元,还被设置为:
In an exemplary embodiment, the third determining unit is further configured to:
获取目标故障告警对应的目标关联周期,其中,目标关联周期用于指示与目标故障告警具有关联关系的属于根因告警的目标关联故障告警所在的时间区间;Obtaining a target correlation period corresponding to the target fault alarm, wherein the target correlation period is used to indicate a time interval in which a target correlation fault alarm belonging to a root cause alarm having a correlation relationship with the target fault alarm is located;
在目标故障告警的获取时间前后目标关联周期的时间范围内查找是否获取到目标关联故障告警;Check whether the target associated fault alarm is obtained within the time range of the target associated cycle before and after the acquisition time of the target fault alarm;
在查找到目标关联故障告警的情况下,忽略目标故障告警;When a target-related fault alarm is found, the target fault alarm is ignored;
在未查找到目标关联故障告警的情况下,上报目标故障告警。If no target-related fault alarm is found, the target fault alarm is reported.
在一个示例性实施例中,装置还包括以下之一:In an exemplary embodiment, the apparatus further comprises one of the following:
第一查找模块,被设置为在目标故障告警的获取时间前后目标关联周期的时间范围内查找是否获取到目标关联故障告警之前,从具有对应关系的故障告警和关联故障告警中查找目标故障告警对应的目标关联故障告警;The first search module is configured to search for a target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship before searching whether the target associated fault alarm is obtained within a time range of a target associated period before and after the acquisition time of the target fault alarm;
提取模块,被设置为从目标故障告警中提取关联故障告警字段,其中,关联故障告警字段用于记录与目标故障告警具有关联关系的属于根因告警的目标关联故障告警。The extraction module is configured to extract an associated fault alarm field from the target fault alarm, wherein the associated fault alarm field is used to record the target associated fault alarm that is a root cause alarm and has an associated relationship with the target fault alarm.
在一个示例性实施例中,装置还包括:In an exemplary embodiment, the apparatus further comprises:
第二确定模块,被设置为在根据目标告警类型确定是否上报目标故障告警之后,根据目标故障告警携带的第三告警信息确定目标故障告警对应的恢复时机;The second determination module is configured to determine the restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm after determining whether to report the target fault alarm according to the target alarm type;
恢复模块,被设置为在检测到服务器达到恢复时机的情况下,恢复目标故障告警。The recovery module is configured to restore the target fault alarm when it is detected that the server reaches the recovery time.
在一个示例性实施例中,第二确定模块,包括:In an exemplary embodiment, the second determining module includes:
第二查找单元,被设置为从目标故障告警中查找重启恢复字段,其中,重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警是否恢复,第三告警信息包括重启恢复字段;A second search unit is configured to search for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;
第四确定单元,被设置为在重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警恢复的情况下,确定恢复时机为目标设备重启。The fourth determining unit is configured to determine the recovery timing as the restart of the target device when the restart recovery field is used to indicate that the target fault alarm is recovered after the target device that generated the target fault alarm is restarted.
在一个示例性实施例中,恢复模块,包括:In an exemplary embodiment, the recovery module includes:
第一检测单元,被设置为检测目标设备是否被执行重启操作;A first detection unit is configured to detect whether a restart operation is performed on the target device;
第一恢复单元,被设置为在检测到目标设备被执行重启操作且目标设备重启成功的情况下,恢复目标故障告警。The first recovery unit is configured to recover the target fault alarm when it is detected that the target device is restarted and the target device is restarted successfully.
在一个示例性实施例中,装置还包括:In an exemplary embodiment, the apparatus further comprises:
第二查找模块,被设置为在从目标故障告警中查找重启恢复字段之后,在重启恢复字段用于指示产生目标故障告警的目标设备重启后目标故障告警不恢复的情况下,从目标故障告警中查找设备标识字段,其中,设备标识字段用于指示产生目标故障告警的目标设备的目标设备标识;A second search module is configured to search the target fault alarm for a device identification field after searching the restart recovery field in the target fault alarm, in the case where the restart recovery field is used to indicate that the target fault alarm is not recovered after the target device generating the target fault alarm is restarted, wherein the device identification field is used to indicate the target device identification of the target device generating the target fault alarm;
第三确定模块,被设置为确定恢复时机为目标设备所在位置上的设备标识更换。The third determining module is configured to determine the recovery timing as a device identifier replacement at a location where the target device is located.
在一个示例性实施例中,恢复模块,包括:In an exemplary embodiment, the recovery module includes:
第二检测单元,被设置为检测目标设备所在位置上的设备标识;A second detection unit is configured to detect a device identifier at a location where a target device is located;
第二恢复单元,被设置为在检测到目标设备所在位置上的设备标识从目标设备标识更换为参考设备标识的情况下,恢复目标故障告警。The second restoring unit is configured to restore the target fault alarm when it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier.
在一个示例性实施例中,获取模块,包括:In an exemplary embodiment, the acquisition module includes:
采集单元,被设置为在检测到服务器中发生故障的情况下,采集故障的目标故障数据;A collection unit is configured to collect target fault data of a fault when a fault is detected in the server;
定位单元,被设置为根据目标故障数据对故障的告警源进行定位,得到目标告警源;A positioning unit is configured to locate the alarm source of the fault according to the target fault data to obtain the target alarm source;
第三恢复单元,被设置为根据目标告警源对故障进行恢复;A third recovery unit is configured to recover the fault according to a target alarm source;
生成单元,被设置为在故障恢复失败的情况下,生成目标故障告警。The generating unit is configured to generate a target fault alarm in case of failure of fault recovery.
在一个示例性实施例中,采集单元,被设置为:In an exemplary embodiment, the acquisition unit is configured to:
对故障进行目标次数的重试;Retry the failure a target number of times;
在重试失败的情况下,采集故障的初始故障数据;
In case of retry failure, collect the initial fault data of the fault;
剔除初始故障数据中超出目标数据区间的数据,得到参考故障数据;Eliminate the data that exceeds the target data interval in the initial fault data to obtain reference fault data;
对参考故障数据进行平均数运算,得到目标故障数据。The reference fault data is averaged to obtain the target fault data.
在一个示例性实施例中,定位单元,被设置为:In an exemplary embodiment, the positioning unit is configured to:
从具有对应关系的故障数据和故障原因中获取目标故障数据对应的故障原因作为候选故障原因;Obtaining the fault cause corresponding to the target fault data from the fault data and the fault causes having a corresponding relationship as a candidate fault cause;
根据服务器中设备的拓扑关系和目标故障数据从候选故障原因中查找目标故障原因;Find the target fault cause from the candidate fault causes according to the topological relationship of the devices in the server and the target fault data;
将目标故障原因在服务器中对应的现场可更换单元FRU确定为目标告警源。A field replaceable unit (FRU) corresponding to the target fault cause in the server is determined as a target alarm source.
在一个示例性实施例中,定位单元,还被设置为:In an exemplary embodiment, the positioning unit is further configured to:
从服务器中设备的拓扑关系中查找目标故障数据对应的目标拓扑关系;Find the target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server;
根据目标拓扑关系中设备的运行状态对候选故障原因进行排查,得到目标故障原因。According to the running status of the equipment in the target topology relationship, the candidate fault causes are checked to obtain the target fault cause.
在一个示例性实施例中,第三恢复单元,还被设置为:In an exemplary embodiment, the third recovery unit is further configured to:
从具有对应关系的告警源和恢复流程中获取目标告警源对应的目标恢复流程;Obtaining a target recovery process corresponding to a target alarm source from alarm sources and recovery processes having a corresponding relationship;
在获取到目标恢复流程的情况下,执行目标恢复流程;When the target recovery process is obtained, the target recovery process is executed;
在未获取到目标恢复流程,或者,目标恢复流程执行失败的情况下,确定故障恢复失败。When the target recovery process is not obtained or the target recovery process fails to be executed, it is determined that the fault recovery has failed.
在一个示例性实施例中,生成单元,还被设置为:In an exemplary embodiment, the generating unit is further configured to:
判断目标故障数据是否落入告警阈值范围内;Determine whether the target fault data falls within the alarm threshold range;
在目标故障数据落入告警阈值范围内的情况下,生成目标故障告警。When the target fault data falls within the alarm threshold range, a target fault alarm is generated.
可选地,本实施例中的可选示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。Optionally, the optional examples in this embodiment may refer to the examples described in the above embodiments and optional implementation modes, and this embodiment will not be described in detail here.
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above modules or steps of the present application can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, and optionally, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order from that herein, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.
以上仅是本申请的可选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。
The above are only optional implementation modes of the present application. It should be pointed out that, for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present application. These improvements and modifications should also be regarded as the protection scope of the present application.
Claims (22)
- 一种服务器故障根因的过滤方法,其特征在于,包括:A method for filtering root causes of server failures, characterized by comprising:获取服务器中产生的目标故障告警;Get the target fault alarm generated in the server;根据所述目标故障告警携带的第一告警信息对所述目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,所述根因告警用于指示对应的故障告警是引起所述服务器故障的根本原因,所述关联告警用于指示对应的故障告警是由所关联的属于所述根因告警的故障告警引起的;Classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;根据所述目标告警类型确定是否上报所述目标故障告警。Determine whether to report the target fault alarm according to the target alarm type.
- 根据权利要求1所述的方法,其特征在于,所述根据所述目标故障告警携带的第一告警信息对所述目标故障告警进行分类,得到目标告警类型,包括:The method according to claim 1, characterized in that the classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain the target alarm type comprises:从所述目标故障告警中查找关联告警字段,其中,所述关联告警字段用于指示所述目标故障告警是否为所述关联告警,所述第一告警信息包括所述关联告警字段;Searching for an associated alarm field from the target fault alarm, wherein the associated alarm field is used to indicate whether the target fault alarm is the associated alarm, and the first alarm information includes the associated alarm field;在所述关联告警字段用于指示所述目标故障告警为所述关联告警的情况下,确定所述目标告警类型为所述关联告警;In a case where the associated alarm field is used to indicate that the target fault alarm is the associated alarm, determining that the target alarm type is the associated alarm;在所述关联告警字段用于指示所述目标故障告警不为所述关联告警的情况下,确定所述目标告警类型为所述根因告警。In a case where the associated alarm field is used to indicate that the target fault alarm is not the associated alarm, it is determined that the target alarm type is the root cause alarm.
- 根据权利要求1所述的方法,其特征在于,所述根据所述目标故障告警携带的第一告警信息对所述目标故障告警进行分类,得到目标告警类型,包括:The method according to claim 1, characterized in that the classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain the target alarm type comprises:从所述目标故障告警中提取目标告警特征,其中,所述目标告警特征用于指示所述目标故障告警的发生原因,所述第一告警信息包括所述目标告警特征;Extracting a target alarm feature from the target fault alarm, wherein the target alarm feature is used to indicate a cause of the target fault alarm, and the first alarm information includes the target alarm feature;根据所述目标告警特征对所述目标故障告警进行分类,得到所述目标告警类型。The target fault alarm is classified according to the target alarm feature to obtain the target alarm type.
- 根据权利要求3所述的方法,其特征在于,所述根据所述目标告警特征对所述目标故障告警进行分类,得到所述目标告警类型,包括:The method according to claim 3 is characterized in that the classifying the target fault alarm according to the target alarm feature to obtain the target alarm type comprises:在所述目标告警特征用于指示所述目标故障告警的发生原因为其他故障告警的情况下,确定所述目标告警类型为所述关联告警;In a case where the target alarm feature is used to indicate that the cause of the target fault alarm is other fault alarms, determining that the target alarm type is the associated alarm;在所述目标告警特征用于指示所述目标故障告警的发生原因为所述服务器中的硬件设备的情况下,确定所述目标告警类型为所述根因告警。In a case where the target alarm feature is used to indicate that the cause of the target fault alarm is a hardware device in the server, the target alarm type is determined to be the root cause alarm.
- 根据权利要求1所述的方法,其特征在于,所述根据所述目标故障告警携带的第一告警信息对所述目标故障告警进行分类,得到目标告警类型,包括:The method according to claim 1, characterized in that the classifying the target fault alarm according to the first alarm information carried by the target fault alarm to obtain the target alarm type comprises:将所述目标故障告警输入目标告警分类模型,其中,所述目标告警分类模型是使用标注了所述根因告警的第一告警样本和标注了所述关联告警类型的第二告警样本对初始告警分类模型进行训练得到的;Inputting the target fault alarm into a target alarm classification model, wherein the target alarm classification model is obtained by training an initial alarm classification model using a first alarm sample labeled with the root cause alarm and a second alarm sample labeled with the associated alarm type;获取所述目标告警分类模型输出的所述目标告警类型。Obtain the target alarm type output by the target alarm classification model.
- 根据权利要求1所述的方法,其特征在于,所述根据所述目标告警类型确定是否上报所述目标故障告警,包括:The method according to claim 1, characterized in that the determining whether to report the target fault alarm according to the target alarm type comprises:在所述目标告警类型为所述根因告警的情况下,上报所述目标故障告警;When the target alarm type is the root cause alarm, reporting the target fault alarm;在所述目标告警类型为所述关联告警的情况下,根据所述目标故障告警携带的第二告警信息确定是否上报所述目标故障告警。In the case that the target alarm type is the associated alarm, whether to report the target fault alarm is determined according to the second alarm information carried by the target fault alarm.
- 根据权利要求6所述的方法,其特征在于,所述根据所述目标故障告警携带的第二告警信息确定是否上报所述目标故障告警,包括:The method according to claim 6, characterized in that the determining whether to report the target fault alarm according to the second alarm information carried by the target fault alarm comprises:获取所述目标故障告警对应的目标关联周期,其中,所述目标关联周期用于指示与所述目标故障告警具有关联关系的属于所述根因告警的目标关联故障告警所在的时间区间; Obtaining a target association period corresponding to the target fault alarm, wherein the target association period is used to indicate a time interval in which a target associated fault alarm belonging to the root cause alarm and having an associated relationship with the target fault alarm is located;在所述目标故障告警的获取时间前后所述目标关联周期的时间范围内查找是否获取到所述目标关联故障告警;Searching whether the target associated fault alarm is obtained within the time range of the target associated period before and after the acquisition time of the target fault alarm;在查找到所述目标关联故障告警的情况下,忽略所述目标故障告警;When the target-related fault alarm is found, ignoring the target fault alarm;在未查找到所述目标关联故障告警的情况下,上报所述目标故障告警。If the target associated fault alarm is not found, the target fault alarm is reported.
- 根据权利要求7所述的方法,其特征在于,在所述目标故障告警的获取时间前后所述目标关联周期的时间范围内查找是否获取到所述目标关联故障告警之前,所述方法还包括以下之一:The method according to claim 7 is characterized in that before searching whether the target associated fault alarm is obtained within the time range of the target associated period before and after the acquisition time of the target fault alarm, the method further comprises one of the following:从具有对应关系的故障告警和关联故障告警中查找所述目标故障告警对应的所述目标关联故障告警;Searching for the target associated fault alarm corresponding to the target fault alarm from the fault alarms and associated fault alarms having a corresponding relationship;从所述目标故障告警中提取关联故障告警字段,其中,所述关联故障告警字段用于记录与所述目标故障告警具有关联关系的属于所述根因告警的目标关联故障告警。An associated fault alarm field is extracted from the target fault alarm, wherein the associated fault alarm field is used to record a target associated fault alarm belonging to the root cause alarm and having an associated relationship with the target fault alarm.
- 根据权利要求1所述的方法,其特征在于,在所述根据所述目标告警类型确定是否上报所述目标故障告警之后,所述方法还包括:The method according to claim 1, characterized in that after determining whether to report the target fault alarm according to the target alarm type, the method further comprises:根据所述目标故障告警携带的第三告警信息确定所述目标故障告警对应的恢复时机;Determining a restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm;在检测到所述服务器达到所述恢复时机的情况下,恢复所述目标故障告警。When it is detected that the server has reached the recovery timing, the target fault alarm is restored.
- 根据权利要求9所述的方法,其特征在于,根据所述目标故障告警携带的第三告警信息确定所述目标故障告警对应的恢复时机,包括:The method according to claim 9, characterized in that determining the restoration timing corresponding to the target fault alarm according to the third alarm information carried by the target fault alarm comprises:从所述目标故障告警中查找重启恢复字段,其中,所述重启恢复字段用于指示产生所述目标故障告警的目标设备重启后所述目标故障告警是否恢复,所述第三告警信息包括所述重启恢复字段;searching for a restart recovery field from the target fault alarm, wherein the restart recovery field is used to indicate whether the target fault alarm is recovered after the target device generating the target fault alarm is restarted, and the third alarm information includes the restart recovery field;在所述重启恢复字段用于指示产生所述目标故障告警的目标设备重启后所述目标故障告警恢复的情况下,确定所述恢复时机为所述目标设备重启。In a case where the restart recovery field is used to indicate that the target fault alarm is recovered after the target device that generated the target fault alarm is restarted, the recovery timing is determined to be the restart of the target device.
- 根据权利要求10所述的方法,其特征在于,在检测到所述服务器达到所述恢复时机的情况下,恢复所述目标故障告警,包括:The method according to claim 10, characterized in that, when it is detected that the server has reached the recovery timing, restoring the target fault alarm comprises:检测所述目标设备是否被执行重启操作;Detecting whether a restart operation is performed on the target device;在检测到所述目标设备被执行所述重启操作且所述目标设备重启成功的情况下,恢复所述目标故障告警。When it is detected that the target device is subjected to the restart operation and the target device is restarted successfully, the target fault alarm is restored.
- 根据权利要求10所述的方法,其特征在于,在所述从所述目标故障告警中查找重启恢复字段之后,所述方法还包括:The method according to claim 10, characterized in that after searching the restart recovery field from the target fault alarm, the method further comprises:在所述重启恢复字段用于指示产生所述目标故障告警的目标设备重启后所述目标故障告警不恢复的情况下,从所述目标故障告警中查找设备标识字段,其中,所述设备标识字段用于指示产生所述目标故障告警的所述目标设备的目标设备标识;In a case where the restart recovery field is used to indicate that the target fault alarm is not restored after the target device generating the target fault alarm is restarted, searching for a device identification field from the target fault alarm, wherein the device identification field is used to indicate a target device identification of the target device generating the target fault alarm;确定所述恢复时机为所述目标设备所在位置上的设备标识更换。The recovery timing is determined to be a device identifier replacement at the location of the target device.
- 根据权利要求12所述的方法,其特征在于,在检测到所述服务器达到所述恢复时机的情况下,恢复所述目标故障告警,包括:The method according to claim 12, characterized in that, when it is detected that the server has reached the recovery timing, restoring the target fault alarm comprises:检测所述目标设备所在位置上的设备标识;Detecting a device identifier at a location where the target device is located;在检测到所述目标设备所在位置上的设备标识从所述目标设备标识更换为参考设备标识的情况下,恢复所述目标故障告警。When it is detected that the device identifier at the location where the target device is located is changed from the target device identifier to the reference device identifier, the target fault alarm is restored.
- 根据权利要求1所述的方法,其特征在于,所述获取服务器中产生的目标故障告警,包括:The method according to claim 1, characterized in that the step of obtaining a target fault alarm generated in the server comprises:在检测到所述服务器中发生故障的情况下,采集所述故障的目标故障数据;In case a fault is detected in the server, collecting target fault data of the fault;根据所述目标故障数据对所述故障的告警源进行定位,得到目标告警源;Locating the alarm source of the fault according to the target fault data to obtain the target alarm source;根据所述目标告警源对所述故障进行恢复;Recovering the fault according to the target alarm source;在所述故障恢复失败的情况下,生成所述目标故障告警。In the case where the fault recovery fails, the target fault alarm is generated.
- 根据权利要求14所述的方法,其特征在于,所述采集所述故障的目标故障数据,包括:The method according to claim 14, characterized in that the collecting target fault data of the fault comprises:对所述故障进行目标次数的重试; Retry the failure a target number of times;在重试失败的情况下,采集所述故障的初始故障数据;In case of retry failure, collecting initial fault data of the fault;剔除所述初始故障数据中超出目标数据区间的数据,得到参考故障数据;Eliminate data that exceeds a target data interval from the initial fault data to obtain reference fault data;对所述参考故障数据进行平均数运算,得到所述目标故障数据。An average operation is performed on the reference fault data to obtain the target fault data.
- 根据权利要求14所述的方法,其特征在于,所述根据所述目标故障数据对所述故障的告警源进行定位,得到目标告警源,包括:The method according to claim 14, characterized in that locating the alarm source of the fault according to the target fault data to obtain the target alarm source comprises:从具有对应关系的故障数据和故障原因中获取所述目标故障数据对应的故障原因作为候选故障原因;Acquire the fault cause corresponding to the target fault data from the fault data and the fault causes having a corresponding relationship as a candidate fault cause;根据所述服务器中设备的拓扑关系和所述目标故障数据从所述候选故障原因中查找目标故障原因;Searching for a target fault cause from the candidate fault causes according to the topological relationship of the devices in the server and the target fault data;将所述目标故障原因在所述服务器中对应的现场可更换单元FRU确定为所述目标告警源。A field replaceable unit FRU corresponding to the target fault cause in the server is determined as the target alarm source.
- 根据权利要求16所述的方法,其特征在于,所述根据所述服务器中设备的拓扑关系和所述目标故障数据从所述候选故障原因中查找目标故障原因,包括:The method according to claim 16, characterized in that the step of searching for a target fault cause from the candidate fault causes based on the topological relationship of the devices in the server and the target fault data comprises:从所述服务器中设备的拓扑关系中查找所述目标故障数据对应的目标拓扑关系;Searching for a target topological relationship corresponding to the target fault data from the topological relationship of the devices in the server;根据所述目标拓扑关系中设备的运行状态对所述候选故障原因进行排查,得到所述目标故障原因。The candidate fault causes are checked according to the operating status of the devices in the target topological relationship to obtain the target fault cause.
- 根据权利要求14所述的方法,其特征在于,所述根据所述目标告警源对所述故障进行恢复,包括:The method according to claim 14, characterized in that the recovering the fault according to the target alarm source comprises:从具有对应关系的告警源和恢复流程中获取所述目标告警源对应的目标恢复流程;Obtaining a target recovery process corresponding to the target alarm source from alarm sources and recovery processes having a corresponding relationship;在获取到所述目标恢复流程的情况下,执行所述目标恢复流程;When the target recovery process is obtained, executing the target recovery process;在未获取到所述目标恢复流程,或者,所述目标恢复流程执行失败的情况下,确定所述故障恢复失败。When the target recovery process is not obtained or the target recovery process fails to be executed, it is determined that the fault recovery has failed.
- 根据权利要求14所述的方法,其特征在于,所述生成所述目标故障告警,包括:The method according to claim 14, characterized in that the generating the target fault alarm comprises:判断所述目标故障数据是否落入告警阈值范围内;Determining whether the target fault data falls within an alarm threshold range;在所述目标故障数据落入所述告警阈值范围内的情况下,生成所述目标故障告警。When the target fault data falls within the alarm threshold range, the target fault alarm is generated.
- 一种服务器故障根因的过滤装置,其特征在于,包括:A device for filtering root causes of server failures, characterized by comprising:获取模块,被设置为获取服务器中产生的目标故障告警;An acquisition module is configured to acquire a target fault alarm generated in a server;分类模块,被设置为根据所述目标故障告警携带的第一告警信息对所述目标故障告警进行分类,得到目标告警类型,其中,故障告警的告警类型包括:根因告警和关联告警,所述根因告警用于指示对应的故障告警是引起所述服务器故障的根本原因,所述关联告警用于指示对应的故障告警是由所关联的属于所述根因告警的故障告警引起的;A classification module is configured to classify the target fault alarm according to the first alarm information carried by the target fault alarm to obtain a target alarm type, wherein the alarm type of the fault alarm includes: a root cause alarm and an associated alarm, wherein the root cause alarm is used to indicate that the corresponding fault alarm is the root cause of the server fault, and the associated alarm is used to indicate that the corresponding fault alarm is caused by the fault alarm associated with the root cause alarm;第一确定模块,被设置为根据所述目标告警类型确定是否上报所述目标故障告警。The first determination module is configured to determine whether to report the target fault alarm according to the target alarm type.
- 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质包括存储的程序,其中,所述程序运行时执行权利要求1至19中任一项所述的方法。A non-volatile readable storage medium, characterized in that the non-volatile readable storage medium includes a stored program, wherein the program executes the method described in any one of claims 1 to 19 when it is run.
- 一种电子装置,包括存储器和处理器,其特征在于,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行权利要求1至19中任一项所述的方法。 An electronic device comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to execute the method according to any one of claims 1 to 19 through the computer program.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310030520.6 | 2023-01-09 | ||
CN202310030520.6A CN115766402B (en) | 2023-01-09 | 2023-01-09 | Method and device for filtering server fault root cause, storage medium and electronic device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024148857A1 true WO2024148857A1 (en) | 2024-07-18 |
Family
ID=85348787
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/121451 WO2024148857A1 (en) | 2023-01-09 | 2023-09-26 | Method and apparatus for filtering root cause of server fault, and non-volatile readable storage medium and electronic apparatus |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115766402B (en) |
WO (1) | WO2024148857A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115766402B (en) * | 2023-01-09 | 2023-04-28 | 苏州浪潮智能科技有限公司 | Method and device for filtering server fault root cause, storage medium and electronic device |
CN117389997B (en) * | 2023-12-12 | 2024-04-16 | 云和恩墨(北京)信息技术有限公司 | Fault detection method and device for database installation flow, electronic equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399347A (en) * | 2018-04-23 | 2019-11-01 | 华为技术有限公司 | Alarm log compression method, apparatus and system, storage medium |
CN110609759A (en) * | 2018-06-15 | 2019-12-24 | 华为技术有限公司 | Fault root cause analysis method and device |
CN110891283A (en) * | 2019-11-22 | 2020-03-17 | 超讯通信股份有限公司 | Small base station monitoring device and method based on edge calculation model |
CN115766402A (en) * | 2023-01-09 | 2023-03-07 | 苏州浪潮智能科技有限公司 | Method and device for filtering fault root cause of server, storage medium and electronic device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107615708A (en) * | 2015-07-10 | 2018-01-19 | 华为技术有限公司 | Alarm information reporting method and device |
CN111459695B (en) * | 2020-03-12 | 2024-09-27 | 平安科技(深圳)有限公司 | Root cause positioning method, root cause positioning device, computer equipment and storage medium |
CN114253610A (en) * | 2021-11-25 | 2022-03-29 | 苏州浪潮智能科技有限公司 | Improved method and device for preventing system from being started normally due to device aging |
CN114356499A (en) * | 2021-12-27 | 2022-04-15 | 山东浪潮科学研究院有限公司 | Kubernetes cluster alarm root cause analysis method and device |
-
2023
- 2023-01-09 CN CN202310030520.6A patent/CN115766402B/en active Active
- 2023-09-26 WO PCT/CN2023/121451 patent/WO2024148857A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399347A (en) * | 2018-04-23 | 2019-11-01 | 华为技术有限公司 | Alarm log compression method, apparatus and system, storage medium |
CN110609759A (en) * | 2018-06-15 | 2019-12-24 | 华为技术有限公司 | Fault root cause analysis method and device |
CN110891283A (en) * | 2019-11-22 | 2020-03-17 | 超讯通信股份有限公司 | Small base station monitoring device and method based on edge calculation model |
CN115766402A (en) * | 2023-01-09 | 2023-03-07 | 苏州浪潮智能科技有限公司 | Method and device for filtering fault root cause of server, storage medium and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN115766402B (en) | 2023-04-28 |
CN115766402A (en) | 2023-03-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024148857A1 (en) | Method and apparatus for filtering root cause of server fault, and non-volatile readable storage medium and electronic apparatus | |
US9672085B2 (en) | Adaptive fault diagnosis | |
Chen et al. | Towards intelligent incident management: why we need it and how we make it | |
CN109039740B (en) | Method and equipment for processing operation and maintenance monitoring alarm | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
CN106209405B (en) | Method for diagnosing faults and device | |
CN111814999B (en) | Fault work order generation method, device and equipment | |
CN110178121A (en) | A kind of detection method and its terminal of database | |
CN107124289B (en) | Weblog time alignment method, device and host | |
JP2017517060A (en) | Fault processing method, related apparatus, and computer | |
US20240272975A1 (en) | Method and system for upgrading cpe firmware | |
CN112000502B (en) | Processing method and device for mass error logs, electronic device and storage medium | |
WO2006117833A1 (en) | Monitoring simulating device, method, and program | |
US10938623B2 (en) | Computing element failure identification mechanism | |
CN113672456A (en) | Modular self-monitoring method, system, terminal and storage medium of application platform | |
CN105607973A (en) | Method, device and system for processing equipment failures in virtual machine system | |
CN115333923B (en) | Fault point tracing analysis method, device, equipment and medium | |
CN110489260B (en) | Fault identification method and device and BMC | |
CN108809729A (en) | The fault handling method and device that CTDB is serviced in a kind of distributed system | |
CN111611097A (en) | Fault detection method, device, equipment and storage medium | |
CN116723085A (en) | Service conflict processing method and device, storage medium and electronic device | |
US9690639B2 (en) | Failure detecting apparatus and failure detecting method using patterns indicating occurrences of failures | |
CN114327988B (en) | Visual network fault relation determination method and device | |
CN116264541A (en) | Multi-dimension-based database disaster recovery method and device | |
AU2014200806B1 (en) | Adaptive fault diagnosis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23915615 Country of ref document: EP Kind code of ref document: A1 |