CN116684306B

CN116684306B - Fault prediction method, device, equipment and readable storage medium

Info

Publication number: CN116684306B
Application number: CN202310781383.XA
Authority: CN
Inventors: 李纪元; 张秀波; 王龙飞
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-11-03
Anticipated expiration: 2043-06-29
Also published as: CN116684306A

Abstract

The invention discloses a fault prediction method, a fault prediction device, fault prediction equipment and a readable storage medium in the technical field of computers. The invention constructs a graph structure related to the alarm information based on the data set collected by each sensor in the alarm time server, thereby calculating the fault prediction data according to the attribute information of each edge node in the graph structure, and then carrying out fault prediction on the target component according to the fault prediction data. Because the attribute information of each edge node comprises the data change trend, the difference between the data and the preset threshold value and the association degree between the data and the alarm information, the change condition of the data of various sensor data at one alarm moment can be comprehensively considered, the prediction precision can be ensured, and the accurate prediction of the faults of the server component at the future moment can be realized.

Description

Fault prediction method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a fault prediction method, apparatus, device, and readable storage medium.

Background

Currently, fault detection is performed on components of a server through a BMC (Board Management Controller baseboard management controller) on the server, and a user is timely informed of which server components have failed. However, this method can only give a fault indication when a fault occurs, and cannot predict whether the component has a fault at a future time.

Therefore, how to predict whether each component in the server will fail at a future time is a problem that one skilled in the art needs to solve.

Disclosure of Invention

In view of the foregoing, it is an object of the present invention to provide a failure prediction method, apparatus, device, and readable storage medium for predicting whether each component in a server fails at a future time. The specific scheme is as follows:

in a first aspect, the present invention provides a fault prediction method, including:

acquiring a data set acquired by each sensor in a server at the alarm moment of a target component in the server;

the target component is taken as a central node, and each sensor is taken as an edge node connected with the central node to create an initial graph structure for obtaining the alarm information of the alarm moment;

marking attribute information for each edge node in the initial graph structure according to the data set to obtain a real-time graph structure; the attribute information of each edge node comprises a data change trend, the difference between the data and a preset threshold value and the association degree between the data and the alarm information;

and calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure, and carrying out fault prediction on the target component according to the fault prediction data.

Optionally, marking attribute information for each edge node in the initial graph structure according to the dataset to obtain a real-time graph structure, including:

acquiring a history map structure; the history graph structure is obtained according to the history sensor data of the alarm time of the same type of alarm information of the target component before the alarm time;

traversing the initial graph structure, and if any edge node in the initial graph structure does not exist in the history graph structure, newly building a current edge node in the initial graph structure in the history graph structure to obtain a graph structure to be marked;

and re-marking attribute information for each edge node in the graph structure to be marked according to the data set to obtain a real-time graph structure.

Optionally, the method further comprises:

if any edge node in the initial graph structure exists in the history graph structure, skipping over the current edge node in the initial graph structure, and continuing to traverse the initial graph structure until the initial graph structure is traversed.

Optionally, the re-marking attribute information for each edge node in the graph structure to be marked according to the data set includes:

Aiming at the non-newly built edge node in the to-be-marked graph structure, inquiring real-time sensor data corresponding to the non-newly built edge node in the data set;

and in the graph structure to be marked, updating the attribute information of the non-newly built edge node according to the history attribute information of the non-newly built edge node in the history graph structure and the real-time sensor data.

Optionally, in the graph structure to be marked, updating the attribute information of the non-newly-built edge node according to the history attribute information of the non-newly-built edge node in the history graph structure and the real-time sensor data, including:

in the graph structure to be marked, determining the data change trend and the difference of data relative to a preset threshold value in the attribute information of the non-newly-built edge node according to the historical attribute information and the real-time sensor data;

and increasing the association degree in the history attribute information according to a preset first rule, and adding the increased association degree to the attribute information of the non-newly built edge node in the to-be-marked graph structure.

Aiming at a newly built edge node in the to-be-marked graph structure, inquiring real-time sensor data corresponding to the newly built edge node in the data set;

and in the graph structure to be marked, updating the attribute information of the newly-built edge node according to the real-time sensor data.

Optionally, in the to-be-marked graph structure, updating the attribute information of the newly-built edge node according to the real-time sensor data includes:

in the graph structure to be marked, determining the data change trend and the difference of data relative to a preset threshold value in the attribute information of the newly-built edge node according to the real-time sensor data;

and generating the association degree in the attribute information of the newly-built edge node according to a preset second rule.

Optionally, obtaining the history map structure according to the history sensor data includes:

acquiring the historical sensor data;

and taking the target component as a central node, taking each sensor corresponding to the historical sensor data as an edge node connected with the central node, and generating and marking attribute information according to each edge node of the historical sensor data to obtain the historical graph structure.

Optionally, before calculating the fault prediction data according to the attribute information of each edge node in the real-time graph structure, the method further includes:

and deleting the edge nodes in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure.

Optionally, the deleting the edge nodes in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure includes:

and deleting the edge nodes with the association degree smaller than the association degree threshold value in the real-time graph structure.

Optionally, the deleting the edge node with the association degree smaller than the association degree threshold in the real-time graph structure includes:

arranging all edge nodes in the real-time graph structure in an inverted order according to the degree of association to obtain a node sequence;

the first N edge nodes in the node sequence are reserved in the real-time graph structure.

arranging all edge nodes in the real-time graph structure in positive sequence according to the association degree to obtain a node sequence;

the last N edge nodes in the node sequence are reserved in the real-time graph structure.

Optionally, the calculating the fault prediction data according to the attribute information of each edge node in the real-time graph structure includes:

and calculating the predicted value of each edge node in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure, and summarizing all the predicted values to obtain the fault prediction data.

Optionally, the calculating the predicted value of each edge node in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure, and summarizing all the predicted values to obtain the fault prediction data includes:

calculating the fault prediction data according to a target formula; the target formula is: s= [ M ] ₁ ×a+N ₁ ×(1-a)]×Q ₁ +…+[M _i ×a+N _i ×(1-a)]×Q _i +…+[M _L ×a+N _L ×(1-a)]×Q _L ；

Wherein S is the fault prediction data, M _i For the data change trend in the attribute information of the edge node i in the real-time graph structure, a is a preset coefficient, and N _i For the difference between the data in the attribute information of the edge node i and the preset threshold value, Q _i I=1, 2,3, …, L is the total number of edge nodes in the real-time graph structure, which is the association degree in the attribute information of the edge node i.

Optionally, the method further comprises:

acquiring another real-time graph structure of other alarm information of the target component at the alarm moment;

If the other real-time graph structure and the real-time graph structure have the same edge node, merging the other real-time graph structure and the real-time graph structure by using the same edge node, and displaying the merged graph structure.

Optionally, the method further comprises:

and calculating additional prediction data according to the combined graph structure, and superposing the additional prediction data and the fault prediction data.

Optionally, the calculating additional prediction data according to the merged graph structure includes:

counting the weight duty ratio of the same edge node in the other real-time graph structure;

and calculating the product of the weight ratio and the fault prediction data corresponding to the other real-time graph structure, and taking the product as additional prediction data.

Optionally, the counting the weight ratio of the same edge node in the other real-time graph structure includes:

taking the ratio of the number of the same edge nodes to the total number of the edge nodes in the other real-time graph structure as the weight duty ratio;

or (b)

And taking the sum of the association degrees of the same edge nodes in the other real-time graph structure as the weight duty ratio.

Optionally, the performing fault prediction on the target component according to the fault prediction data includes:

Determining a data interval to which the fault prediction data belong;

and calculating a fault occurrence probability interval of the target component based on the data interval.

In a second aspect, the present invention provides a failure prediction apparatus, comprising:

the acquisition module is used for acquiring a data set acquired by each sensor in the server at the alarm moment of the target component in the server;

the creation module is used for creating an initial graph structure of the alarm information at the alarm moment by taking the target component as a central node and taking each sensor as an edge node connected with the central node;

the marking module is used for marking attribute information for each edge node in the initial graph structure according to the data set to obtain a real-time graph structure; the attribute information of each edge node comprises a data change trend, the difference between the data and a preset threshold value and the association degree between the data and the alarm information;

and the prediction module is used for calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure and carrying out fault prediction on the target component according to the fault prediction data.

Optionally, the marking module includes:

the acquisition unit is used for acquiring the history graph structure; the history graph structure is obtained according to the history sensor data of the alarm time of the same type of alarm information of the target component before the alarm time;

The traversing unit is used for traversing the initial graph structure, and if any edge node in the initial graph structure does not exist in the history graph structure, the current edge node in the initial graph structure is newly built in the history graph structure to obtain a graph structure to be marked;

and the marking unit is used for re-marking attribute information for each edge node in the graph structure to be marked according to the data set to obtain a real-time graph structure.

Optionally, the traversing unit is further configured to:

Optionally, the marking unit is specifically configured to:

Optionally, a history map structure generating module is configured to obtain the history map structure according to the history sensor data;

correspondingly, the history map structure generating module is specifically configured to:

acquiring the historical sensor data;

Optionally, the method further comprises:

and the pruning module is used for pruning the edge nodes in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure before calculating the fault prediction data according to the attribute information of each edge node in the real-time graph structure.

Optionally, the pruning module is specifically configured to:

Optionally, the prediction module is specifically configured to:

Optionally, the method further comprises:

the diagram merging module is used for acquiring another real-time diagram structure of other alarm information of the target component at the alarm moment; if the other real-time graph structure and the real-time graph structure have the same edge node, merging the other real-time graph structure and the real-time graph structure by using the same edge node, and displaying the merged graph structure.

Optionally, the method further comprises:

and the data superposition module is used for calculating additional prediction data according to the combined graph structure and superposing the additional prediction data with the fault prediction data.

Optionally, the data superposition module is specifically configured to:

or (b)

Optionally, the prediction module is specifically configured to:

determining a data interval to which the fault prediction data belong;

In a third aspect, the present invention provides an electronic device, comprising:

a memory for storing a computer program;

And a processor for executing the computer program to implement the previously disclosed fault prediction method.

In a fourth aspect, the present invention provides a readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the previously disclosed fault prediction method.

According to the scheme, the invention provides a fault prediction method, which comprises the following steps: acquiring a data set acquired by each sensor in a server at the alarm moment of a target component in the server; the target component is taken as a central node, and each sensor is taken as an edge node connected with the central node to create an initial graph structure for obtaining the alarm information of the alarm moment; marking attribute information for each edge node in the initial graph structure according to the data set to obtain a real-time graph structure; the attribute information of each edge node comprises a data change trend, the difference between the data and a preset threshold value and the association degree between the data and the alarm information; and calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure, and carrying out fault prediction on the target component according to the fault prediction data.

The invention has the technical effects that: based on the alarm information of the target component in the server, collecting the data set collected by each sensor in the server at the current alarm moment, and then establishing an initial graph structure of the alarm information of the alarm moment by taking the target component as a central node and taking the edge nodes connected with each sensor as the central node; marking attribute information for each edge node in the initial graph structure according to the data set to obtain a real-time graph structure; and finally, calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure, and carrying out fault prediction on the target component according to the fault prediction data. The invention can predict the occurrence probability of the type of alarm information at the future time according to the sensor data related to certain alarm information at the alarm time, and the scheme can ensure the prediction precision and realize the fault precise prediction of the target component at the future time because the data change trend, the difference between the data and the preset threshold and the association degree between the data and the alarm information of various sensor data at one alarm time are comprehensively considered.

Correspondingly, the fault prediction device, the equipment and the readable storage medium provided by the invention also have the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a fault prediction method disclosed by the invention;

FIG. 2 is a schematic view of an initial diagram structure of the present disclosure;

FIG. 3 is a flow chart illustrating a fault prediction scheme implementation of the present disclosure;

FIG. 4 is a schematic diagram of a trend of sensor data according to the present disclosure;

FIG. 5 is a schematic diagram of a real-time diagram structure according to the present disclosure;

FIG. 6 is a schematic diagram of a fault prediction apparatus according to the present disclosure;

FIG. 7 is a schematic diagram of an electronic device according to the present disclosure;

FIG. 8 is a diagram of a server according to the present invention;

fig. 9 is a diagram of a terminal structure according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, fault detection is carried out on components of a server through a BMC on the server, and a user is timely informed of which server components have faults. However, this method can only give a fault indication when a fault occurs, and cannot predict whether the component has a fault at a future time. To this end, the present invention provides a failure prediction scheme capable of predicting whether each component in a server fails at a future time.

Referring to fig. 1, the embodiment of the invention discloses a fault prediction method, which comprises the following steps:

s101, acquiring a data set acquired by each sensor in the server at the alarm time of the target component in the server.

In this embodiment, the target component may be a CPU (Central Processing Unit ) in the server, a memory, a disk, a fan, or the like. Each sensor in the server may include: temperature sensors, voltage sensors, etc. To avoid the data set being too large, only sensor data associated with the alarm information at the current alarm time may be acquired, for example: if the alarm information at the alarm time is that the CPU temperature is too high, the associated sensor data can be: fan speed, power supply temperature, etc. As to which sensor data are associated with a certain alarm information, a data table can be preset to record according to experience and server operation, and when a data set acquired by each sensor in the server is acquired, query and determination are performed according to the corresponding record in the data table.

Of course, all the sensor data in the server may be acquired first to form a data set, and after S103, the real-time graph structure is pruned to reduce the complexity of prediction, and then S104 is performed. In one embodiment, before calculating the fault prediction data according to the attribute information of each edge node in the real-time graph structure, the method further includes: and deleting the edge nodes in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure.

In one embodiment, pruning edge nodes in the real-time graph structure according to attribute information of each edge node in the real-time graph structure includes: and deleting the edge nodes with the association degree smaller than the association degree threshold value in the real-time graph structure. The association degree of the data and the alarm information is used for representing: the greater the degree of influence of the sensor data on a certain alarm information, the more the association degree is, which indicates that the size change of the sensor data is easy to cause faults.

In one embodiment, deleting edge nodes in the real-time graph structure with the association degree smaller than the association degree threshold comprises: arranging all edge nodes in the real-time graph structure in an inverted order according to the degree of association to obtain a node sequence; the first N edge nodes in the node sequence are retained in the real-time graph structure. Or arranging all edge nodes in the real-time graph structure in positive sequence according to the association degree to obtain a node sequence; the last N edge nodes in the node sequence are retained in the real-time graph structure. N may be a preset value.

S102, an initial graph structure of alarm information at the moment of alarm is created by using a target component as a central node and using edge nodes connected by using sensors as the central node.

And S103, marking attribute information for each edge node in the initial graph structure according to the data set to obtain a real-time graph structure.

The attribute information of each edge node comprises a data change trend, a difference between the data and a preset threshold value and a degree of association between the data and alarm information.

The number of nodes, connection relation and the like in the initial graph structure and the real-time graph structure are identical, and only each edge node in the initial graph structure is not marked with attribute information. In one example, the initial graph structure may be seen in fig. 2, where fig. 2 has 4 edge nodes associated with it in disk-like center nodes.

In one embodiment, marking attribute information for each edge node in an initial graph structure according to a dataset to obtain a real-time graph structure includes: acquiring a history map structure; the history graph structure is obtained according to the history sensor data of the alarm time of the same type of alarm information of the target component before the alarm time; traversing the initial graph structure, if any edge node in the initial graph structure does not exist in the historical graph structure, newly building a current edge node in the initial graph structure in the historical graph structure to obtain a graph structure to be marked; if any edge node in the initial graph structure exists in the history graph structure, skipping over the current edge node in the initial graph structure, and continuing to traverse the initial graph structure until the initial graph structure is traversed. And re-marking attribute information for each edge node in the graph structure to be marked according to the data set to obtain a real-time graph structure. The same type of alarm information is the same as the alarm information in S102, but the alarm time is different. If the alarm information in S102 is: and if the CPU temperature is too high, the same type of alarm information is that the CPU temperature is too high before.

The central nodes of the history graph structure, the graph structure to be marked, the initial graph structure and the real-time graph structure are identical, and the number, the connection relation and the like of the nodes in the graph structure to be marked, the initial graph structure and the real-time graph structure are identical. The number of edge nodes in the history graph structure is more than the initial graph structure or less than the initial graph structure. The newly built edge node is: the history graph structure has no edge nodes but is not in the initial graph structure. The non-newly built edge node is: the history graph structure has edge nodes that are also in the initial graph structure.

In one embodiment, re-labeling attribute information for each edge node in a graph structure to be labeled according to a dataset includes: aiming at the non-newly built edge nodes in the structure of the graph to be marked, inquiring real-time sensor data corresponding to the non-newly built edge nodes in the data set; and in the structure of the graph to be marked, updating the attribute information of the non-newly built edge node according to the history attribute information of the non-newly built edge node in the history graph structure and the real-time sensor data. In the graph structure to be marked, updating the attribute information of the non-newly built edge node according to the history attribute information of the non-newly built edge node in the history graph structure and the real-time sensor data, wherein the method comprises the following steps: in the structure of the graph to be marked, determining the data change trend and the difference of the data in the attribute information of the non-newly-built edge node relative to a preset threshold according to the historical attribute information and the real-time sensor data; and increasing the association degree in the history attribute information according to a preset first rule, and adding the increased association degree to the attribute information of the non-newly-built edge node in the structure of the graph to be marked. The sensor data corresponding to the non-newly built edge node repeatedly appear at the alarm moment, and the influence of the sensor data of the non-newly built edge node on the alarm information can be considered to be large, so that the association degree is increased. The first rule is preset for increasing the association degree, for example: increment by a fixed step value.

In one embodiment, re-labeling attribute information for each edge node in a graph structure to be labeled according to a dataset includes: aiming at a newly built edge node in the structure of the graph to be marked, inquiring real-time sensor data corresponding to the newly built edge node in the data set; and in the structure of the graph to be marked, updating the attribute information of the newly built edge node according to the real-time sensor data. In the structure of the graph to be marked, updating attribute information of the newly built edge node according to real-time sensor data, including: in the structure of the graph to be marked, determining the data change trend and the difference of the data in the attribute information of the newly built edge node relative to a preset threshold according to the real-time sensor data; and generating the association degree in the attribute information of the newly-built edge node according to a preset second rule. The preset second rule can preset a weight table according to the influence of the sensor data on the alarm information, and then inquire the weight table to determine the association degree of the newly built edge node.

In one embodiment, deriving a history map structure from historical sensor data includes: acquiring historical sensor data; and taking the target component as a central node, taking each sensor corresponding to the historical sensor data as an edge node connected with the central node, and generating and marking attribute information according to each edge node of the historical sensor data to obtain a historical graph structure.

S104, calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure, and carrying out fault prediction on the target component according to the fault prediction data.

In one embodiment, calculating fault prediction data from attribute information of each edge node in a real-time graph structure includes: and calculating the predicted value of each edge node in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure, and summarizing all the predicted values to obtain fault prediction data.

In one embodiment, calculating a predicted value of each edge node in the real-time graph structure according to attribute information of each edge node in the real-time graph structure, and summarizing all the predicted values to obtain fault prediction data, including: calculating fault prediction data according to a target formula; the target formula is: s= [ M ] ₁ ×a+N ₁ ×(1-a)]×Q ₁ +…+[M _i ×a+N _i ×(1-a)]×Q _i +…+[M _L ×a+N _L ×(1-a)]×Q _L The method comprises the steps of carrying out a first treatment on the surface of the Wherein S is failure prediction data, M _i For the data change trend in the attribute information of the edge node i in the real-time graph structure, a is a preset coefficient, N _i For the difference between the data in the attribute information of the edge node i and the preset threshold value, Q _i I=1, 2,3, …, L is the total number of edge nodes in the real-time graph structure, which is the association degree in the attribute information of the edge node i.

In one embodiment, the method further comprises: acquiring another real-time graph structure of other alarm information of the target component at the alarm moment; if the other real-time graph structure and the real-time graph structure have the same edge node, merging the other real-time graph structure and the real-time graph structure by using the same edge node, and displaying the merged graph structure. And calculating additional prediction data according to the combined graph structure, and superposing the additional prediction data and the fault prediction data so as to predict the possibility that the target component generates the alarm information in the S102 again at the future moment according to the combined graph structure, wherein the possibility that the target component predicts other alarm information again at the future moment can be also predicted according to the combined graph structure. The other alarm information is an alarm of the target component generated simultaneously with the alarm information in S102, and if the alarm information in S102 is: if the temperature of the CPU is too high, another alarm message of the CPU at the same time can be: the load is too high. If the real-time graph structures of both have the same edge node, the same edge node can be associated and merged with the graph structure according to the same edge node. The combined graph structure not only can be shown, but also can participate in the fault prediction of the target component, such as: the likelihood of the CPU re-developing too high a CPU temperature at a future time or the likelihood of the CPU re-developing too high a load at a future time is predicted.

In one embodiment, computing additional prediction data from the merged graph structure includes: counting the weight duty ratio of the same edge node in another real-time graph structure; and calculating the product of the weight ratio and the fault prediction data corresponding to the other real-time graph structure, and taking the product as the additional prediction data. The statistics of the weight duty ratio of the same edge node in another real-time graph structure comprises the following steps: the ratio of the number of the same edge nodes to the total number of the edge nodes in another real-time graph structure is used as a weight duty ratio; or the sum of the degree of association of the same edge node in another real-time graph structure as the weight duty cycle.

In one embodiment, performing fault prediction on a target component based on fault prediction data includes: determining a data interval to which the fault prediction data belong; a failure occurrence probability section of the target member is calculated based on the data section. The greater the failure prediction data, the greater the possibility that the target component reappears the alarm information in S102, and in order to quantitatively describe the failure probability, in this embodiment, a plurality of data intervals are preset, into which data interval the failure prediction data falls, and two endpoints of the data interval are converted into percentages, so that a failure occurrence probability interval can be obtained.

It can be seen that, in this embodiment, based on the alarm information of the target component in the server, the data set collected by each sensor in the server at the current alarm time is collected, then, an initial graph structure of the alarm information of the alarm time is created by using the target component as a central node and using the edge node connected with each sensor as the central node; marking attribute information for each edge node in the initial graph structure according to the data set to obtain a real-time graph structure; and finally, calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure, and carrying out fault prediction on the target component according to the fault prediction data. The invention can predict the occurrence probability of the type of alarm information at the future time according to the sensor data related to certain alarm information at the alarm time, and the scheme can ensure the prediction precision and realize the fault precise prediction of the target component at the future time because the data change trend, the difference between the data and the preset threshold and the association degree between the data and the alarm information of various sensor data at one alarm time are comprehensively considered.

The invention is realized by a deep learning technology, and a deep learning model is constructed, and the method provided by the invention can be realized by the deep learning model.

Firstly, a deep learning model is built, please refer to fig. 3, the deep learning model comprises an input layer, a network layer and an output layer, wherein the input layer is used for receiving certain alarm information of a certain server component and sensor data at the current moment; the network layer is used for constructing and updating the graph structure based on the alarm information and the sensor data at the current moment, and calculating fault prediction data; the output layer is used for carrying out fault prediction according to the fault prediction data. The specific structures of the input layer, the network layer and the output layer, the specific training process of the model, the loss function used for training, the optimizer and the like can refer to the related art, and the embodiment is not described herein again.

Specifically, certain alarm information of a certain server component may be obtained from a system event log recorded by the BMC or the BIOS (Basic Input Output System ), and the sensor data may be obtained from monitoring data recorded by the BMC. Generally, one piece of alarm information includes: log number, log generation time, log source (BMC or BIOS), component type of alarm, haptic number of alarm, alarm type (e.g., threshold triggered, described and status type, etc.), whether the current alarm is triggered untouched or triggered released, and detailed description information of alarm, etc.

The processing of the alarm information A and the related sensor data by the network layer comprises the following steps: analyzing the time of the alarm; acquiring each sensor data at the time, analyzing the sensor data with strong correlation and the sensor data with weak correlation, and distributing weights (namely the correlation degree) to each sensor data; the sensor data is pruned by weight, e.g., the weight is kept large. And constructing a graph structure by using the reserved sensor data and centering on the current alarming part, and storing the graph structure.

And if the alarm information A appears again, updating the graph structure according to the flow so as to re-mark the change trend of the sensor data, the difference between the data and the preset threshold value and the weight of the sensor data. The change trend of the sensor data is determined by analyzing the data acquired by the same sensor at two moments.

Illustratively, if the alert information a is: CPU Thermal Trip Occurred, namely: the CPU reaches the limit temperature, and extracts the time when the alarm occurs as the alarm event code at the moment, and the alarm event code is 07000102, which indicates that the alarm event code is taken as a central node to construct a graph structure, and the alarm event code can represent the component of the CPU and the alarm information A. And determining the sensor data of each alarm moment according to the time sequence of the alarm information A which occurs many times, and then drawing a data change trend schematic diagram shown in fig. 4. The sensor data for each alarm time instant includes: CPU temperature, air inlet temperature, fan rotation speed and power supply temperature. Weights can be allocated to different sensor data, and the weights allocated to the CPU temperature, the air inlet temperature, the fan rotating speed and the power supply temperature in the embodiment are as follows: 4. 2, 3 and 1.

Referring to fig. 4, analyzing the numerical trend of the first two historical moments, it can be found that the CPU temperature is in a large rising trend; the rotation speed of the fan is in a large descending trend; the temperature of the air inlet is in an ascending trend, and the temperature of the power supply is steadily reduced. Hereby a real-time graph structure as shown in fig. 5 can be formed. Then, an alarm message A is input into the model, the corresponding graph structure is updated and changed, and the changing content comprises: the number of nodes connected by the central node, the weight of each node, the trend and the difference from a preset threshold (the normal value acquired by each sensor). It can be seen that with the application of the deep learning model, the outputted graph structure can show more and more accurate information.

When the BMC inputs sensor data of each moment into the model aiming at any alarm information of any server component, the model builds and updates a graph structure, calculates fault prediction data and then performs fault prediction according to the fault prediction data. And when the predicted fault occurrence probability is larger than the threshold value, prompting the user. The prompting method is various and can be presented in a log mode or can be realized in a mail sending mode according to the user setting.

Therefore, in this embodiment, a graph structure related to the alarm information is constructed based on the data set collected by each sensor in the alarm time server, so that the graph structure can calculate the fault prediction data, and then the fault prediction is performed on the target component according to the fault prediction data, and the prediction process comprehensively considers the change condition of the data of various sensor data at one alarm time, so that the prediction precision can be ensured, and the fault accurate prediction of the server component at the future time can be realized.

A fault prediction device provided in the embodiments of the present invention is described below, and a fault prediction device described below may refer to other embodiments described herein.

Referring to fig. 6, an embodiment of the present invention discloses a failure prediction apparatus, including:

the acquisition module 601 is configured to acquire a data set acquired by each sensor in the server at an alarm time of a target component in the server;

the creating module 602 is configured to create an initial graph structure of alarm information at an alarm moment by using a target component as a central node and using edge nodes connected with each sensor as the central node;

a marking module 603, configured to mark attribute information for each edge node in the initial graph structure according to the data set, so as to obtain a real-time graph structure; the attribute information of each edge node comprises a data change trend, the difference between the data and a preset threshold value and the association degree of the data and alarm information;

And the prediction module 604 is used for calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure and performing fault prediction on the target component according to the fault prediction data.

In one embodiment, the marking module includes:

the traversing unit is used for traversing the initial graph structure, and if any edge node in the initial graph structure does not exist in the history graph structure, the current edge node in the initial graph structure is newly built in the history graph structure to obtain the graph structure to be marked;

In one embodiment, the traversal unit is further configured to:

In one embodiment, the marking unit is specifically for:

Aiming at the non-newly built edge nodes in the structure of the graph to be marked, inquiring real-time sensor data corresponding to the non-newly built edge nodes in the data set;

and in the structure of the graph to be marked, updating the attribute information of the non-newly built edge node according to the history attribute information of the non-newly built edge node in the history graph structure and the real-time sensor data.

In one embodiment, the marking unit is specifically for:

in the structure of the graph to be marked, determining the data change trend and the difference of the data in the attribute information of the non-newly-built edge node relative to a preset threshold according to the historical attribute information and the real-time sensor data;

and increasing the association degree in the history attribute information according to a preset first rule, and adding the increased association degree to the attribute information of the non-newly-built edge node in the structure of the graph to be marked.

In one embodiment, the marking unit is specifically for:

aiming at a newly built edge node in the structure of the graph to be marked, inquiring real-time sensor data corresponding to the newly built edge node in the data set;

and in the structure of the graph to be marked, updating the attribute information of the newly built edge node according to the real-time sensor data.

In one embodiment, the marking unit is specifically for:

In the structure of the graph to be marked, determining the data change trend and the difference of the data in the attribute information of the newly built edge node relative to a preset threshold according to the real-time sensor data;

In one embodiment, the history map structure generating module is configured to obtain a history map structure according to the history sensor data;

acquiring historical sensor data;

and taking the target component as a central node, taking each sensor corresponding to the historical sensor data as an edge node connected with the central node, and generating and marking attribute information according to each edge node of the historical sensor data to obtain a historical graph structure.

In one embodiment, the method further comprises:

In one embodiment, the pruning module is specifically configured to:

the first N edge nodes in the node sequence are retained in the real-time graph structure.

In one embodiment, the pruning module is specifically configured to:

arranging all edge nodes in the real-time graph structure according to the positive sequence of the association degree to obtain a node sequence;

the last N edge nodes in the node sequence are retained in the real-time graph structure.

In one embodiment, the prediction module is specifically configured to:

and calculating the predicted value of each edge node in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure, and summarizing all the predicted values to obtain fault prediction data.

In one embodiment, the prediction module is specifically configured to:

calculating fault prediction data according to a target formula; the target formula is: s= [ M ] ₁ ×a+N ₁ ×(1-a)]×Q ₁ +…+[M _i ×a+N _i ×(1-a)]×Q _i +…+[M _L ×a+N _L ×(1-a)]×Q _L ；

Wherein S is failure prediction data, M _i For the data change trend in the attribute information of the edge node i in the real-time graph structure, a is a preset coefficient, N _i For the difference between the data in the attribute information of the edge node i and the preset threshold value, Q _i I=1, 2,3, …, L is the total number of edge nodes in the real-time graph structure, which is the association degree in the attribute information of the edge node i.

In one embodiment, the method further comprises:

In one embodiment, the data superposition module is specifically configured to:

counting the weight duty ratio of the same edge node in another real-time graph structure;

and calculating the product of the weight ratio and the fault prediction data corresponding to the other real-time graph structure, and taking the product as the additional prediction data.

In one embodiment, the data superposition module is specifically configured to:

the ratio of the number of the same edge nodes to the total number of the edge nodes in another real-time graph structure is used as a weight duty ratio;

or (b)

The sum of the degree of association of the same edge node in another real-time graph structure is taken as the weight duty cycle.

In one embodiment, the prediction module is specifically configured to:

Determining a data interval to which the fault prediction data belong;

a failure occurrence probability section of the target member is calculated based on the data section.

The more specific working process of each module and unit in this embodiment may refer to the corresponding content disclosed in the foregoing embodiment, and will not be described herein.

It can be seen that the present embodiment provides a failure prediction apparatus capable of predicting whether each component in a server has failed at a future time.

An electronic device provided in the embodiments of the present invention is described below, and an electronic device described below may refer to other embodiments described herein.

Referring to fig. 7, an embodiment of the present invention discloses an electronic device, including:

a memory 701 for storing a computer program;

a processor 702 for executing the computer program to implement the method disclosed in any of the embodiments above.

Further, the embodiment of the invention also provides electronic equipment. The electronic device may be the server 50 shown in fig. 8 or the terminal 60 shown in fig. 9. Fig. 8 and 9 are each a block diagram of an electronic device according to an exemplary embodiment, and the contents of the drawings should not be construed as any limitation on the scope of use of the present invention.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention. The server 50 may specifically include: at least one processor 51, at least one memory 52, a power supply 53, a communication interface 54, an input output interface 55, and a communication bus 56. Wherein the memory 52 is configured to store a computer program that is loaded and executed by the processor 51 to implement the relevant steps in the monitoring of a publishing application as disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 53 is configured to provide an operating voltage for each hardware device on the server 50; the communication interface 54 can create a data transmission channel between the server 50 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present invention, which is not specifically limited herein; the input/output interface 55 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application needs, which is not limited herein.

The memory 52 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon include an operating system 521, a computer program 522, and data 523, and the storage may be temporary storage or permanent storage.

The operating system 521 is used for managing and controlling various hardware devices on the Server 50 and the computer program 522 to implement the operation and processing of the data 523 in the memory 52 by the processor 51, which may be Windows Server, netware, unix, linux, etc. The computer program 522 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the monitoring method of the publishing application disclosed in any of the foregoing embodiments. The data 523 may include data such as application program developer information in addition to data such as application program update information.

Fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention, and the terminal 60 may specifically include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.

Generally, the terminal 60 in this embodiment includes: a processor 61 and a memory 62.

Processor 61 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 61 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 61 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 61 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 61 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 62 may include one or more computer-readable storage media, which may be non-transitory. Memory 62 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 62 is at least used for storing a computer program 621, where the computer program is loaded and executed by the processor 61, and then can implement relevant steps in the method for monitoring a distribution application executed by the terminal side as disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 62 may also include an operating system 622, data 623, and the like, and the storage manner may be transient storage or permanent storage. The operating system 622 may include Windows, unix, linux, among others. The data 623 may include, but is not limited to, update information of the application.

In some embodiments, the terminal 60 may further include a display 63, an input-output interface 64, a communication interface 65, a sensor 66, a power supply 67, and a communication bus 68.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting of the terminal 60 and may include more or fewer components than shown.

A readable storage medium provided by embodiments of the present invention is described below, and the readable storage medium described below may be referred to with respect to other embodiments described herein.

A readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the fault prediction method disclosed in the foregoing embodiments. The readable storage medium is a computer readable storage medium, and can be used as a carrier for storing resources, such as read-only memory, random access memory, magnetic disk or optical disk, wherein the resources stored on the readable storage medium comprise an operating system, a computer program, data and the like, and the storage mode can be transient storage or permanent storage.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of readable storage medium known in the art.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of fault prediction, comprising:

calculating fault prediction data according to attribute information of each edge node in the real-time graph structure, and carrying out fault prediction on the target component according to the fault prediction data;

Wherein the calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure includes:

2. The method according to claim 1, wherein marking attribute information for each edge node in the initial graph structure according to the dataset results in a real-time graph structure, comprising:

3. The method as recited in claim 2, further comprising:

4. The method according to claim 2, wherein the re-labeling attribute information for each edge node in the graph structure to be labeled according to the dataset comprises:

5. The method according to claim 4, wherein the updating, in the graph structure to be marked, the attribute information of the non-newly built edge node according to the history attribute information of the non-newly built edge node in the history graph structure and the real-time sensor data includes:

6. The method according to claim 2, wherein the re-labeling attribute information for each edge node in the graph structure to be labeled according to the dataset comprises:

7. The method according to claim 6, wherein updating the attribute information of the newly created edge node according to the real-time sensor data in the graph structure to be marked comprises:

8. The method of claim 2, wherein deriving the history map structure from the history sensor data comprises:

acquiring the historical sensor data;

9. The method of claim 1, wherein before calculating the failure prediction data based on the attribute information of each edge node in the real-time graph structure, further comprising:

10. The method according to claim 9, wherein the pruning of edge nodes in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure comprises:

11. The method according to claim 10, wherein deleting edge nodes in the real-time graph structure having a degree of association less than a threshold degree of association comprises:

12. The method according to claim 10, wherein deleting edge nodes in the real-time graph structure having a degree of association less than a threshold degree of association comprises:

13. The method according to claim 1, wherein calculating the predicted value of each edge node in the real-time graph structure according to the attribute information of each edge node in the real-time graph structure, and aggregating all the predicted values to obtain the fault prediction data includes:

14. The method according to any one of claims 1 to 13, further comprising:

15. The method as recited in claim 14, further comprising:

16. The method of claim 15, wherein said calculating additional prediction data from the merged graph structure comprises:

17. The method of claim 16, wherein said counting the weight duty cycle of the same edge node in the other real-time graph structure comprises:

or (b)

18. The method according to any one of claims 1 to 13, wherein said performing fault prediction on said target component from said fault prediction data comprises:

determining a data interval to which the fault prediction data belong;

19. A failure prediction apparatus, comprising:

the prediction module is used for calculating fault prediction data according to the attribute information of each edge node in the real-time graph structure and carrying out fault prediction on the target component according to the fault prediction data;

the prediction module is specifically configured to:

20. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the method of any one of claims 1 to 18.

21. A readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the method of any one of claims 1 to 18.