CN110727553A

CN110727553A - Method and device for predicting and diagnosing faults of processor system

Info

Publication number: CN110727553A
Application number: CN201910980020.2A
Authority: CN
Inventors: 付宇卓; 刘婷; 戴宗哲; 吉学刚; 曹德明; 申子正
Original assignee: Shanghai Jiaotong University; Zhongtong Bus Holding Co Ltd
Current assignee: Shanghai Jiaotong University; Zhongtong Bus Holding Co Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2020-01-24

Abstract

The invention provides a method and a device for predicting and diagnosing faults of a processor system, wherein the method for predicting and diagnosing faults of the processor system comprises the following steps: constructing a structured fault sample data set by using a data cleaning method; establishing a propagation relation network of the processor system fault by using a Bayesian network model; and cascading the Bayesian network model by using an LSTM network to predict and diagnose the fault of the processor system. The device for predicting and diagnosing the fault of the processor system is suitable for executing the method for predicting and diagnosing the fault of the processor system. The method can realize bidirectional probabilistic reasoning of 'fault source-system failure behavior' of the processor system in a data driving mode, and realize failure type prediction of the processor system fault in a model cascading mode.

Description

Method and device for predicting and diagnosing faults of processor system

Technical Field

The invention relates to the technical field of prediction and diagnosis of multi-component system faults, in particular to a method and a device for predicting and diagnosing processor system faults.

Background

In the aerospace field, a processor system with high operation stability is adopted for program control, data transmission and state monitoring. However, high-energy electrons in the space environment can be accumulated on the surface of the device, and when the high-energy electrons reach a certain threshold value, logic inversion can be caused, so that software errors of the processor system can be caused, and further hidden danger can be brought to stable operation of the processor system. Therefore, early health assessment of the processor system is particularly important, and the health assessment of the processor system includes two aspects of fault diagnosis and prediction of the processor system, wherein the fault diagnosis refers to that when the system sends a fault signal, the fault source is positioned by analyzing the fault characteristics of the current system, and the fault prediction refers to that when the fault does not occur, the type of the fault which may occur is predicted by using the current observation node characteristics of the system.

At present, fault diagnosis models are based on signal feature analysis and data driving, wherein the signal feature analysis is to perform fault source positioning through threshold judgment after signal processing, and the data driving is to generate a fault source positioning model through training by using historical data. Common fault prediction models are model-based and data-driven, where the model-based approach requires that the data model of the subject system be known, and the data-driven is trained using historical data to generate the fault prediction model. Therefore, in the prior art, the fault diagnosis model and the fault prediction model are usually separated, which is not only inconvenient to use, but also high in maintenance cost.

Further, regarding fault diagnosis, most algorithms only support fault diagnosis of a single node at present, the fault type and fault reason of a processor system are relatively complex, and hierarchy and fault propagation exist among components, so that the reason is difficult to locate after a fault is generated; for the fault prediction, a dynamic time sequence algorithm is further required to predict the fault in consideration of the dynamic property of the time sequence, but the intermediate time state of the processor system cannot be obtained, and an indirect mode is further required to realize the fault prediction of the processor system.

Therefore, how to provide a multi-node fault diagnosis algorithm becomes a problem to be solved urgently by those skilled in the art.

It is noted that the information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

It is a first object of the present invention to provide a method for predicting and diagnosing processor system faults and a second object of the present invention to provide an apparatus for predicting and diagnosing processor system faults.

In order to achieve the first purpose of the invention, the invention is realized by the following technical scheme: a method for predicting and diagnosing processor system faults comprises the following steps:

s100: constructing a structured fault sample data set by using a data cleaning method;

s200: establishing a propagation relation network of the processor system fault by using a Bayesian network model;

s300: and cascading the Bayesian network model by using an LSTM network to predict and diagnose the fault of the processor system.

Optionally, before performing step S100, determining a fault injection point and an observation node, injecting a fault, and collecting fault timing data to obtain a fault sample initial data set.

Optionally, in step S100, the constructing a structured fault sample data set by using a data cleaning manner includes,

standardizing the initial data set of the fault sample according to a preset standardization rule;

and marking the failure behavior of the processor system by using the standardized initial fault sample data set, and constructing the structured fault sample data set consisting of a fault source, an observation node and a failure label.

Optionally, the preset standardization rule includes an abnormal value of the fault time series data, filtering of a constant value, contraction of a characteristic value interval, and observation node abnormality judgment.

Optionally, the observation node abnormality determination includes performing similarity calculation of an average euclidean distance between the fault time series data and reference time series data in which a fault is not injected, so as to determine whether the observation node is abnormal.

Optionally, the system failure types of the failure behavior callout include no impact, error output, timeout, and CPU exception.

Optionally, in step S200, the establishing a propagation relation network of the processor system fault by using the bayesian network model includes,

s210: dividing the structured fault sample data set into a first training set and a first test set;

s220: establishing an initial Bayesian network model by utilizing the structured fault sample data set;

s230: initializing a conditional probability table, and training the initial Bayesian network model by using the first training set to obtain a trained conditional probability table;

s240: and using the first test set to continuously train the trained initial Bayesian network model according to the accuracy of prediction and diagnosis of the trained initial Bayesian network model to obtain the Bayesian network model.

Optionally, the method for predicting and diagnosing the trained Bayesian network model includes,

and predicting the failure behavior from the fault source to the processor system by utilizing the forward reasoning of the trained Bayesian network model, and diagnosing the failure behavior from the processor system to the fault source by utilizing the reverse reasoning of the trained Bayesian network model.

Optionally, the cascading the Bayesian network models to predict the failure of the processor system using the LSTM network comprises,

s310: dividing the structured fault sample data set into a second training set and a second test set;

s320: carrying out normalization processing on the second training set and the second testing set to enable the value range of the second training set and the value range of the second testing set to be within a set threshold range;

s330: adjusting the time step, and training the LSTM network by using the second training set and the second testing set until the prediction precision of the LSTM network is saturated;

s340: carrying out reverse normalization on the predicted value so that the predicted value is regressed to an original value range;

and S350, judging the abnormity of the observation node on the predicted value, and then cascading the Bayesian network to predict the fault of the processor system.

To achieve the second object, the present invention further provides an apparatus for predicting and diagnosing a processor system failure, comprising,

a fault injection unit: utilizing a data cleaning method for constructing a structured fault sample data set;

a model construction unit: establishing a propagation relation network of the processor system fault by utilizing a Bayesian network model;

a prediction diagnosis unit: utilizing an LSTM network for cascading the Bayesian network models and for predicting and diagnosing faults of the processor system.

Compared with the prior art, the method and the device for diagnosing and predicting the fault of the processor system, provided by the invention, are not only suitable for diagnosing the fault of the processor system, but also suitable for predicting the fault of the processor system, and further have the following beneficial effects:

1. the large-scale data driving enables the Bayesian network to be trained more objectively, and bidirectional probabilistic reasoning of a failure source-system failure behavior of the processor system is realized, so that diagnosis of the processor system failure and prediction of the processor system failure can be realized.

2. And predicting the time sequence value of the observation node by using an LSTM network in the recurrent neural network, and predicting the system failure type by cascading the Bayesian network.

3. The fault sample data set is obtained by injecting the fault and collecting the fault time sequence data, so that the method is more objective, does not need to consider the hierarchy among the components of the processor system and the fault propagation, and does not depend on the prediction and diagnosis of the artificial related experience of the fault of the processor system.

Furthermore, because a fault sample data set is obtained by injecting faults and collecting fault time sequence data, and the subsequent processing does not depend on the prediction and diagnosis of human related experience any more, the method and the device for predicting and diagnosing the faults of the processor system are not only suitable for predicting and diagnosing the faults of the processor system, but also can be further expanded and applied to a general complex system, namely a system consisting of a plurality of components when the fault sample data set comes from the general complex system, wherein the components have local observability, and the association between the components has dynamics and uncertainty.

Drawings

FIG. 1 is a flow chart of a method for predicting and diagnosing processor system faults according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps for constructing structured fault sample data using a data cleaning method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps for establishing a propagation relationship network for a processor system fault using a Bayesian network model, in accordance with an embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps for predicting processor system failure behavior using an LSTM network and a Bayesian network cascade according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus for predicting and diagnosing processor system faults according to an embodiment of the present invention;

wherein the reference numerals are as follows:

100-fault injection unit, 200-model building unit, 300-prediction diagnosis unit.

Detailed Description

Before specifically describing the method and apparatus for predicting and diagnosing processor system faults provided by the present invention, a brief description will be given of the core idea of the present invention. Through intensive research and study, the inventor finds that the fault diagnosis and prediction method of the processor system in the prior art has the following problems:

problem 1: the system used by the method for diagnosing and treating the system fault of the processor is mutually independent, and is inconvenient to maintain and use.

Problem 2: most of the diagnosis methods only support fault diagnosis of a single node, the fault type and the failure principle of the processor system are relatively complex, hierarchy and fault propagation exist among components, and if accurate positioning is needed, human related experience is greatly relied on.

Problem 3: for failure prediction, the dynamics of the timing are not taken into account.

Based on the method, the invention provides a method for predicting and diagnosing the faults of the processor system, which realizes the bidirectional probabilistic reasoning of the fault source-system failure behavior of the processor system in a data driving mode and realizes the failure type prediction of the processor system in a model cascade mode.

Meanwhile, the invention also provides a device for predicting and diagnosing the faults of the processor system, which is used for implementing the method for predicting and diagnosing the faults of the processor system.

To make the objects, advantages and features of the present invention more apparent, a method and an apparatus for predicting and diagnosing a processor system fault according to the present invention will be described in detail with reference to the accompanying drawings. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention. It should be understood that the drawings are not necessarily to scale, showing the particular construction of the invention, and that illustrative features in the drawings, which are used to illustrate certain principles of the invention, may also be somewhat simplified. Specific design features of the invention disclosed herein, including, for example, specific dimensions, orientations, locations, and configurations, will be determined in part by the particular intended application and use environment.

In one embodiment of the present invention, a method for diagnosing and predicting a fault of a processor system is provided, and is shown in fig. 1, including the following steps:

Further, before step S100 is executed, a fault injection point and an observation node are determined, a fault is injected, and fault time series data is collected to obtain a fault sample initial data set. Obviously, the invention is not limited to the specific means of obtaining the initial data set of the fault sample, and the selection method of the fault injection point and the observation node. In this embodiment, a system simulation platform is used to model critical components and functions of a processor system, and then an automatic fault injection and timing acquisition method is used to obtain an initial data set of a fault sample. In other embodiments, non-automated fault injection and timing acquisition methods may also be employed. Preferably, a full system simulation platform can be used for acquisition. Including but not limited to Simics, Gem5, etc. The Simics is a complete system simulation technology, and provides a method for software and system developers, architects and test engineers to construct and use virtual systems for various purposes or create a plurality of virtual connection systems; GEM5 is a modular discrete event-driven system-wide simulator that combines the best-shown parts of M5 (multiprocessor simulator) and GEMs (storage level simulator), and is an architecture simulator that is highly configurable, integrating multiple ISAs and multiple CPU models. GEM5 has been able to support a variety of commercial ISAs including X86, ARM, ALPHA, MIPS, Power, SPARC, etc., and is able to load the LINUX operating system on X86, ARM, ALPHA. Is a true full-system computer architecture simulation tool. Similarly, the fault injection point includes but is not limited to CPU, memory, and external devices, etc. according to the actual requirement, preferably, the observation node is set according to the fault injection point, for example, all possible access attributes in CPU can be used as observation node, including but not limited to register, interrupt status, and other time-related attributes, etc.

In this embodiment, referring to fig. 2, in step S100, the specific steps of constructing a structured fault sample data set by using a data cleaning manner are as follows:

s110: formatting the failure sample initial data set. Specifically, in order to adapt to the prediction of the LSTM network in the later stage, the abnormal and repeated failure time sequence data existing in the failure sample initial data set is removed, and a first failure sample initial data set is obtained.

And then, normalizing the first fault sample initial data according to a preset normalization rule. The preset standardization rule comprises an abnormal value and a constant value of the fault time sequence data, characteristic value interval contraction and observation node abnormity judgment. The method comprises the following specific steps:

s120: and filtering the formatted initial data set of the first fault sample by a non-numerical type and a constant type. Specifically, the non-numerical type includes that the fault time series data of a part of the observation nodes are difficult to quantify in subsequent processing, and the constant type includes that the fault time series data of a part of the observation nodes are constant.

S130: the eigenvalue interval shrinks. Considering that the distribution range of the values of the fault time sequence data is large, in order to adapt to the later-stage LSTM prediction and avoid the phenomenon that the predicted values show quantity level differences due to overlarge distribution differences of the values of the fault time sequence data, the first fault sample initial data set is subjected to characteristic value interval contraction, so that the values of the fault time sequence data are in a preset interval range.

S140: and judging the abnormity of the observation node. Specifically, the similarity of the mean euclidean distance between the fault time sequence data and the reference time sequence data without fault injection is calculated, so as to judge whether the observation node is abnormal.

S150: and marking the failure behavior of the processor system by using the standardized initial fault sample data set, and constructing the structured fault sample data set consisting of a fault source, an observation node and a failure label. And the system failure types marked by the failure behaviors comprise no influence, error output, timeout and CPU exception. In practical applications, the distribution of the system failure types is unbalanced, for example, the proportion of the non-influence is larger, and the proportion of the CPU abnormality is smaller.

It should be apparent to those skilled in the art that the above steps are merely illustrative of the preferred embodiment and not limiting, and that in practice some steps may be omitted. For example, if the initial failure sample initial data set does not have abnormal data, invalid data and other unexpected data, step S110 may be omitted.

Further, in step S200, a propagation relationship network of the processor system fault is established by using the bayesian network model. The Bayesian network is a network used for uncertainty inference in a probability graph model, and is based on a Bayesian formula, and the influence relationship among observation nodes is represented by using a directed graph, so that not only can the propagation path of a fault be well explained, but also the fault can be predicted by using forward inference, diagnosed by using reverse inference, and further, the problem of uncertainty can be processed. For example, when the result occurs, the probability of occurrence of various causal events can be calculated by using a Bayesian formula. The training of the Bayesian network comprises two parts, topology learning and conditional probability table learning. After the network topology structure is determined, learning parameters by using a Bayesian estimation method, namely, initializing a conditional probability table, and continuously updating the conditional probability table by using sample iteration.

In this embodiment, referring to fig. 3, the establishing a propagation relationship network of a processor system fault by using a bayesian network model includes the following steps:

s210: the structured fault sample data set is divided into a first training set and a first test set. Specifically, in order to accurately evaluate the performance of the model in consideration of the imbalance of the failure types, the structured fault sample data set needs to be divided, and preferably, 80% of each failure type is randomly selected as a first training set, and the remaining 20% is selected as a first test set. After the training of the later-stage network model is finished, the prediction accuracy of each failure type on the first training set and the first testing set can be calculated respectively. Preferably, the structured fault sample data set is divided into a first training set and the first test set in a random manner.

S220: and establishing an initial Bayesian network model by using the structured fault sample data set. In particular, the bayesian network topology can be defined in a combined manner of artificial constraints and sample learning, i.e. semi-automatically. The artificial constraints include empirically adding the observation nodes that caused anomalies in some fault injection process.

S230: and initializing a conditional probability table, and training the initial Bayesian network model by using the first training set to obtain the trained conditional probability table. And after the network topology is defined, performing parameter learning of the initial Bayesian network model. Specifically, the conditional probability table is initialized first, and the conditional probability table is continuously updated iteratively through sample learning using the first training set. And outputting a corresponding trained conditional probability table after the traversing of the whole first training set is finished.

S240: and using the first test set to continuously train the trained initial Bayesian network model according to the accuracy of prediction and diagnosis of the trained initial Bayesian network model to obtain the Bayesian network model. Specifically, the steps S210 and S230 are continuously repeated for a preset number of times, the prediction accuracy on the first test set after the training of the bayesian network model is completed each time is recorded, and the bayesian network model with higher prediction accuracy is selected as the optimal bayesian network model. After the best Bayesian network model is obtained, when the corresponding failure type occurs in the final system state node, backtracking can be carried out to find the position where the corresponding failure occurs. Specifically, the method for predicting and diagnosing the trained Bayesian network model comprises the steps of diagnosing the failure behavior of the processor system to a fault source by utilizing the reverse reasoning of the trained Bayesian network model, and predicting the failure behavior of the fault source to the processor system by utilizing the forward reasoning of the trained Bayesian network model.

Preferably, the predicting the failure behavior of the fault source to the processor system by using the forward reasoning of the trained bayesian network model comprises using an lstm (long Short Term memory) network to cascade the bayesian network model to predict the fault of the processor system. The LSTM network can realize the state value prediction of multiple observation nodes and has long-term memory.

Specifically, referring to fig. 4, the step of using the LSTM network to cascade the bayesian network model to predict the failure of the processor system includes the steps of:

s310: and dividing the structured fault sample data set into a second training set and a second testing set. In this embodiment, the structured fault sample data set is divided into the second training set and the second test set in a random manner, where 80% of the structured fault sample data set is used as the second training set, and 20% of the structured fault sample data set is used as the second test set.

S320: and carrying out normalization processing on the second training set and the second testing set to enable the value range to be within a set threshold range. In this embodiment, the set threshold range is a [0,1] interval.

S330: and adjusting the time step, and training the LSTM network by using the second training set and the second testing set until the prediction precision of the LSTM network is saturated. And for the preset sampling times, performing fault time sequence prediction by using the LSTM network until the predicted value of the observation node at the final moment of the processor system is predicted.

S340: and performing inverse normalization on the predicted value so that the predicted value is regressed to an original value range. Specifically, after the LSTM network training is completed each time, the inverse normalization operation is performed on the predicted value, so that the predicted value is restored to the initial value range from the set threshold range. In this embodiment, the set threshold range is a [0,1] interval.

And S350, judging the abnormity of the observation node on the predicted value, and then cascading the Bayesian network to predict the fault of the processor system. The failure of the processor system includes the type of failure at the end time.

By adjusting the preset sampling times in the test, the optimal balance between the sampling times of the cascade model and the prediction accuracy is realized, and the sampling overhead of the fault time sequence data of the observation node is further reduced.

Referring to fig. 5, in yet another embodiment of the present invention, there is provided a processor system fault predicting and diagnosing apparatus, which is adapted to perform the processor system fault predicting and diagnosing method provided in the foregoing implementation, including,

fault injection unit 100: and utilizing a data cleaning method to construct a structured fault sample data set.

Model construction unit 200: and utilizing a Bayesian network model for establishing a propagation relation network of the processor system fault.

The prediction diagnosis unit 300: utilizing an LSTM network for cascading the Bayesian network models and for predicting and diagnosing faults of the processor system.

In particular, the fault injection unit 100 provides a complete structured fault sample data set for data driven fault analysis. The model construction unit 200 utilizes the large-scale data driving of the structured fault sample data set, so that the training of the Bayesian network is more objective, and the bidirectional probabilistic reasoning of the fault source-system failure behavior is realized. The prediction diagnosis unit 300 predicts the time sequence value of the observation node by using the LSTM network, and performs prediction of the system failure type by cascading the bayesian networks.

Therefore, the method and the device for predicting and diagnosing the faults of the processor system solve the problems of multi-fault source positioning and dynamic time sequence prediction of the processor system. The method is based on a Bayesian network theory in a probability map model, and large-scale data driving is utilized to enable the training of the Bayesian network to be more objective, so that bidirectional probability reasoning of fault source-system failure behaviors is realized. Meanwhile, the LSTM network is used for predicting the time sequence value of the observation node, and the Bayesian network is cascaded to predict the system failure type.

In summary, the above embodiments have been described in detail with respect to various configurations of the method and apparatus for predicting and diagnosing processor system faults, it is to be understood that the above description is only a description of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method for predicting and diagnosing processor system faults is characterized by comprising the following steps:

2. The method for predicting and diagnosing faults of a processor system according to claim 1, further comprising determining fault injection points and observation nodes, injecting faults and collecting fault timing data to obtain an initial data set of fault samples before performing step S100.

3. The method according to claim 2, wherein the step S100 of constructing the structured fault sample data set by using the data cleansing method comprises,

4. The method as claimed in claim 3, wherein the predetermined standardized rules include non-numerical values, constant value filtering, eigenvalue interval shrinking, and observation node anomaly determination of the fault timing data.

5. The method of claim 4, wherein the observation node anomaly determination includes performing a similarity calculation of an average Euclidean distance between the fault time sequence data and reference time sequence data without fault injection, so as to determine whether the observation node is anomalous.

6. The method of claim 3, wherein the system failure types of the failure behavior flags include no impact, false output, timeout, and CPU exception.

7. The method for predicting and diagnosing faults of a processor system according to claim 1, wherein the step S200 of establishing a propagation relationship network of faults of the processor system by using a Bayesian network model comprises,

8. The method of predicting and diagnosing processor system faults according to claim 7, wherein the method of predicting and diagnosing the trained Bayesian network model includes,

9. The method of predicting and diagnosing a processor system fault as recited in claim 1, wherein cascading the Bayesian network model to predict the fault of the processor system using the LSTM network comprises,

10. An apparatus for predicting and diagnosing processor system faults, comprising: