CN116755964A

CN116755964A - Fault prediction and health management system for reinforcement server

Info

Publication number: CN116755964A
Application number: CN202310698766.0A
Authority: CN
Inventors: 程智鹏; 刘宗宝; 刘更; 郭申; 闵新宇; 甄志伟
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-09-15

Abstract

The invention relates to a fault prediction and health management system of a reinforcement server, and belongs to the technical field of computer health management. The invention adopts a multisource data acquisition technology based on IPMI and system kernel analysis, a quantitative fault diagnosis technology based on a fault mode tree, a health assessment technology based on system multivariate characteristics, a fault prediction technology based on data driving and the like to sense the state of a software and hardware system of a reinforcement server, monitor the running condition of equipment, monitor and analyze the data, diagnose and locate the fault type of the abnormal state of the system, evaluate the running health of the system, predict the occurrence of faults, realize autonomous diagnosis and autonomous guarantee of the system, greatly improve the operation and maintenance efficiency of the system and improve the safety and reliability of the system.

Description

Fault prediction and health management system for reinforcement server

Technical Field

The invention belongs to the technical field of computer health management, and particularly relates to a fault prediction and health management system of a reinforcement server.

Background

The reinforcement server is used as a comprehensive data computing and information processing server, has the characteristics of high information processing speed, high reliability requirement and the like, and is widely applied to systems such as command control, information guarantee and the like. The reinforcement server is an integral system with the coupling of software and hardware, has very strict requirements on the stability of the system, and along with the continuous improvement of the functions and the performances of the reinforcement server system, the probability of faults and functional failures is increased, the types of the faults are increased, and how to effectively reduce the fault rate of the software and hardware system of the reinforcement server is an important problem which needs to be solved urgently by the software and hardware system of the reinforcement server.

At present, the maintenance and guarantee mode of the reinforcement server mainly adopts two modes of fault repair and timing maintenance, once equipment fails, shutdown repair is usually adopted, and all maintenance and guarantee measures are passive remedial measures based on the fault; for regular maintenance, no matter how good the actual working condition of the equipment is, according to the established maintenance time and maintenance strategy, regular maintenance, minor repair, middle repair and major repair are carried out regularly, and the problems of long maintenance time, high maintenance cost, poor pertinence and the like exist.

Aiming at the problems, by means of fault prediction and health management technology, the occurrence of faults of the software and hardware systems of the reinforcement server, the running health condition of the system is represented, maintenance planning and decision guarantee can be automatically identified, and the transition of a maintenance strategy from a fault-oriented 'planned maintenance' strategy and a fault-oriented 'post maintenance' strategy to an optionally maintenance strategy based on state monitoring and health management is realized, so that the use and maintenance cost of the reinforcement server is reduced, and the safety, the integrity and the task success of the software and hardware systems of the reinforcement server are improved.

Disclosure of Invention

First, the technical problem to be solved

The invention aims to solve the technical problems of long maintenance time, high maintenance cost, poor pertinence and the like in the prior art by providing a fault prediction and health management system for a reinforcement server.

(II) technical scheme

In order to solve the above technical problems, the present invention provides a system for predicting failure and managing health of a reinforcement server, the system comprising: the system comprises a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module and a human-computer interaction interface;

the data acquisition module acquires multi-source data of the reinforcement server system and provides data sources for other modules of the system, and the acquired data is divided into out-of-band information and in-band information of the system;

the data storage module is used for storing the collected system multi-source data into a database for system state backup and providing historical data of system parameter states for the fault prediction module;

the fault mode tree module is used for defining fault types of software and hardware systems of the reinforcement server, relevant parameters of the fault types and parameter fault thresholds, and constructing a fault mode tree by analyzing typical fault information of the reinforcement server;

the real-time monitoring module monitors the real-time state of each parameter of the system in real time according to the parameter state threshold value in the fault mode tree, when the parameter state exceeds the threshold value limit, the system is regarded as abnormal condition, the real-time monitoring module transmits the abnormal parameter type to the fault diagnosis module, and the fault diagnosis module is activated to realize autonomous fault diagnosis;

the fault diagnosis module analyzes the collected system abnormal parameter state data, and diagnoses whether the abnormal state is an instantaneous abnormal alarm or a fault by using two judging methods consisting of probabilistic judgment and continuity judgment;

the health evaluation module is used for constructing a parameter-level, component-level and system-level health evaluation model according to the state information of the multi-element system as a data basis and reflecting the current system health state of the system; the parameter level health evaluation model evaluates the running state of a single parameter and constructs a nonlinear evaluation function based on the deviation degree of the parameter state and the fault critical state; the component-level health evaluation model evaluates the health degree of each component of the system, and a health evaluation function of each component is constructed by aggregating the parameter health degree of each component and adopting a weighting method; likewise, the system-level health assessment model builds a health assessment function of the system by aggregating the health degrees of all the components and adopting a weighting method;

the fault prediction module predicts the future state of the parameter by adopting a data-driven time sequence prediction model according to the historical time sequence data of the state of the system parameter, and performs fault diagnosis on the future state of the system parameter according to a fault mode tree so as to complete fault prediction; the fault prediction module adopts an ARIMA model to construct a time sequence prediction model, and adopts an online model updating mode to automatically update the model;

the man-machine interaction interface displays the real-time state information, fault and alarm information and information related to the health degree and fault prediction result of the system, and simultaneously provides the configuration function of a system fault mode tree and the inquiry function of the historical state of the system.

(III) beneficial effects

The invention provides a fault prediction and health management system and method of a reinforcement server, wherein the system and the method adopt a multi-source data acquisition technology based on IPMI and system kernel analysis, a quantitative fault diagnosis technology based on fault mode tree, a health assessment technology based on system multi-element characteristics, a fault prediction technology based on data driving and the like to sense the state of a software and hardware system of the reinforcement server, monitor the running condition of equipment, monitor and analyze through data, diagnose and locate the fault type of the abnormal state of the system, evaluate the running health of the system, predict the occurrence of faults, realize autonomous diagnosis and autonomous guarantee of the system, greatly improve the operation and maintenance efficiency of the system and improve the safety and reliability of the system.

Drawings

FIG. 1 is a block diagram of a fault prediction and health management system of the present invention;

FIG. 2 is a block flow diagram of a fault diagnosis module of the present invention;

FIG. 3 is a block flow diagram of a health assessment module of the present invention;

FIG. 4 is a block flow diagram of a fault prediction module of the present invention.

Detailed Description

To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.

The invention relates to the technical field of computer health management, in particular to a system and a method for data acquisition, state monitoring, fault diagnosis, health assessment and fault prediction of a reinforcement server.

The invention aims to provide a fault prediction and health management method of a reinforcement server, which aims to solve the problem of how to design a fault prediction and health management system aiming at the application requirements of autonomous guarantee and autonomous diagnosis of operation and maintenance of the reinforcement server and provide functions of fault diagnosis, health assessment, fault prediction and the like.

The invention provides a fault prediction and health management system and method for a reinforcement server. The system aims at reinforcing the software and hardware systems of the server, performs operations such as data acquisition, real-time monitoring, fault diagnosis, health evaluation, fault prediction and the like, provides the running condition information of the software and hardware systems of the server, and provides necessary technical means for running maintenance of the server.

As shown in FIG. 1, the system and method for predicting and managing faults of a reinforcement server provided by the invention are divided into a data layer, a monitoring layer, an application layer and an interaction layer, and comprise a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module and a man-machine interaction interface. The system comprises a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module, a man-machine interaction interface and an interaction layer, wherein the data acquisition module and the data storage module are located on a data layer, the fault mode tree module and the real-time monitoring module are located on a monitoring layer, the fault diagnosis module, the health evaluation module and the fault prediction module are located on an application layer.

The data acquisition module acquires multi-source data of the reinforcement server system and provides data sources for other modules of the system, and the acquired data is divided into out-of-band information and in-band information of the system, and specifically comprises the following steps: the method comprises the steps of collecting out-of-band information of a system by using an IPMI protocol, wherein collected data at least comprises system power supply information, hardware temperature information and the like; the in-band information of the kernel file acquisition system of the operating system is read, and the acquired data at least comprises running state information of components such as a CPU, a memory, a disk, a network and the like, such as the utilization rate, the idle rate and the like.

The data storage module is used for storing the collected system multi-source data into a database for system state backup, providing historical data of system parameter states for the fault prediction module, and simultaneously providing a data basis for subsequent manual investigation and algorithm improvement.

The fault mode tree module is used for defining fault types, fault type related parameters and parameter fault thresholds of the software and hardware systems of the reinforcement server. The fault mode tree is constructed by analyzing typical fault information of the reinforcement server, and as shown in table 1, the fault classification at the component level at least comprises a CPU fault, a memory fault, a disk fault, a network fault, a hardware fault and the like, wherein the hardware fault is mainly a system component voltage fault and a life part (such as a fan) fault. In particular, the failure mode tree is configurable, and the failure mode may be modified according to subsequent additions or modifications to the server components and applications and updates to the failure mode, including parameter threshold settings, modifications to the failure mode, additions or deletions.

TABLE 1 failure mode Tree

And the real-time monitoring module monitors the real-time state of each parameter of the system in real time according to the parameter state threshold value in the fault mode tree. When the parameter state exceeds the threshold limit, the system is considered to be in an abnormal condition, the real-time monitoring module transmits the abnormal parameter type to the fault diagnosis module, and the fault diagnosis module is activated to realize the autonomous fault diagnosis.

The fault diagnosis module analyzes the collected system abnormal parameter state data, and uses two judging methods consisting of probabilistic judgment and continuity judgment to diagnose whether the abnormal state is an instantaneous abnormal alarm or a fault. Aiming at fault false alarms caused by instantaneous abnormality of the system parameter state, the two-step judging method consisting of probabilistic judgment and continuity judgment is adopted to diagnose the abnormal parameter state data of the system, the abnormal state is positioned as parameter fault occurrence only when the probabilistic judgment and the continuity judgment result are faults, fault information is output according to a fault mode tree, and otherwise, only alarm information is output. The probability judgment is used for judging whether the fault occurs or not, and the principle is that whether the percentage of the time point of the abnormal state of the parameter in the total time exceeds a probability threshold value or not in the specified time. The continuity judgment is used for judging whether the parameter abnormal state continuously occurs or not, and judging that the fault occurs when the abnormal state continuously occurs. The continuity judgment is based on probabilistic judgment, and the principle is that whether the maximum continuous time point of the abnormal state of the parameter accounts for the total time of the abnormal state of the parameter in a specified time exceeds a continuity threshold value or not. The thresholds for probabilistic and continuity determinations are configurable and can be modified by personnel based on actual results.

Specifically, the workflow of the fault diagnosis module is shown in fig. 2, and includes:

s21, acquiring parameter state data of N time points according to abnormal parameters and fault modes to which the parameters belong, wherein the abnormal parameters and the fault modes are fed back by the real-time monitoring module;

s22, analyzing and diagnosing the acquired parameter state data by using probabilistic judgment, outputting parameter abnormal state alarm information only when the probabilistic judgment result is negative, and triggering continuity judgment when the probabilistic judgment result is positive;

s23, analyzing and diagnosing the acquired parameter state data by using the continuity judgment, outputting parameter abnormal state warning information only when the continuity judgment result is negative, judging that the parameter abnormal state is a fault when the continuity judgment result is positive, and outputting fault mode information of the abnormal parameter.

And the health evaluation module is used for constructing a parameter-level, component-level and system-level health evaluation model according to the multivariate system state information as a data basis and reflecting the current system health state of the system. The parameter level health evaluation mainly evaluates the running state of a single parameter, and builds a nonlinear evaluation function based on the deviation degree of the parameter state and the fault critical state. The component-level health evaluation mainly evaluates the health degree of each component of the system, and a health evaluation function of each component is constructed by aggregating the parameter health degree of each component and adopting a weighting method. Likewise, system level health assessment builds a health assessment function of the system by aggregating the health of the various components using a weighted approach.

And for the parameter-level health evaluation model, constructing a health evaluation function according to the deviation degree of the current running state of the parameter and the threshold value of the critical abnormal state. When the parameter state approaches to the critical abnormal state, the change trend of the health state is increased, so that a nonlinear change mode is adopted to construct a parameter-level health evaluation function. Parameters can be classified into a constant type and a percentage type according to the state value type, can be classified into a single-side threshold type and a double-side threshold type according to the state threshold value interval, and can be classified into an increment type and a decrement type according to the relation between the state value and the change of the health degree. And constructing corresponding health evaluation functions according to different types of parameters. The parameter types related to the parameters of the computing system are five types of single-threshold interval percentage decrease, single-threshold interval percentage increase, single-threshold interval constant decrease, single-threshold interval constant increase and double-threshold interval.

For better describing the method, the software and hardware system, the components and the parameters of the reinforced computer are defined, the definition system is composed of n components, the number of the parameters related to each component is m, and the state of the j parameter of the i component is expressed as x _ij Wherein i is [1, n ]]，j∈[1,m]. For parameter level health assessment functions, different assessment functions need to be designed depending on the type of parameters.

1) Unilateral threshold percentage decrease parameter

For parameter x _ij In terms of this, the health assessment function is as follows:

wherein θ _ij Is the parameter x _ij Is a critical abnormal state threshold of (2).

2) Single-sided threshold percentage increment parameter

3) Single-sided threshold constant decrementing parameter

4) Single-sided threshold constant increment parameter

5) Bilateral threshold parameters

wherein, thereinAs the standard value of the parameter, θ _ij Is a fault threshold value where the parameter state value deviates from the parameter standard value.

For the component health assessment model:

for the ith component, its health assessment function expression is as follows:

z _i ＝C _i ·Y _i

wherein C is _i ＝[c _i1 … c _im ]The method is a weight matrix of parameters, the influence degree of each parameter on the health degree of the component is represented, the weight value of each parameter is artificially determined according to methods such as expert knowledge, priori knowledge and the like, and the weight value can be configured and changed according to the actual use effect; y is Y _i ＝[y _i1 … y _im ] ^T Is a matrix of the health of the relevant parameters of the component.

For the system health assessment model:

the system health assessment function expression is as follows:

u＝W·Z

wherein W= [ W ₁ … w _n ]The weight matrix is a weight matrix of the components, the influence degree of each component on the system health degree is represented, the weight value of each component is artificially determined by adopting methods such as expert knowledge, priori knowledge and the like, and the weight value can be configured and changed according to the actual use effect; z= [ Z ] ₁ … z _n ] ^T Is a health matrix of the components of the system.

Specifically, as shown in fig. 3, the workflow of the health assessment module is:

s31, according to the type of the system parameter, adopting a corresponding parameter health evaluation function to evaluate the health degree of each parameter;

s32, aggregating the health degrees of all the parameters according to the system components, and evaluating the health degrees of all the components according to the health evaluation functions of all the components;

s33, aggregating the health degree of each component of the system, and evaluating the overall health degree of the system according to a system health degree evaluation function.

The fault prediction module predicts the future state of the parameters by adopting a data-driven time sequence prediction model according to the historical time sequence data of the state of the system parameters, and performs fault diagnosis on the future state of the system parameters according to a fault mode tree, so that the fault prediction is completed. In the embodiment, an ARIMA model is adopted to construct a time sequence prediction model. Aiming at the problem that the software and hardware system of the reinforcement server is greatly influenced by the outside, the state of system parameters is greatly changed by the outside input, so that the prediction precision of the time sequence prediction model is reduced, and the model is automatically updated in an online model updating mode. The specific workflow of the failure prediction module is shown in fig. 4, and the specific steps are as follows:

s41, acquiring system parameter state history time sequence data related to a fault mode tree from a database;

s42, checking whether the time series data of each parameter are stable, if the checking result is non-stable, entering a step S43, and if the checking result is stable, entering a step S46;

s43, carrying out data difference processing on non-stationary data;

s44, performing stability verification on the processed data, if the verification result is stable, entering a step S45, and if the verification result is non-stable, returning to the step S43;

s45, updating a parameter state prediction model according to the time sequence data after the difference processing;

s46, predicting the parameter state in future time by using a parameter state prediction model and using the parameter time series data as a data basis;

s47, carrying out fault analysis and diagnosis on the future state of the system according to the prediction result of the future state of the parameter and the fault mode tree, and obtaining the future running condition of the system.

The man-machine interaction interface mainly displays related information such as real-time state information, fault and alarm information, health degree and fault prediction result of the system, and provides configuration function of a system fault mode tree, system history state query function and the like.

It can be seen that the invention mainly adopts the following technical means:

1) The multi-source data acquisition method comprises the following steps: the method collects out-of-band data information of the software and hardware systems of the reinforcement server through an IPMI protocol, collects in-band information of the software and hardware systems through a mode of analyzing an operating system kernel running file, and achieves a multi-source data collection function of the software and hardware systems of the server;

2) The fault diagnosis method based on the fault mode tree and quantitative analysis comprises the following steps: according to the method, a fault diagnosis method of quantitative analysis is constructed according to a fault mode tree of a server software and hardware system of a server typical fault framework, and meanwhile, a two-step method consisting of probabilistic judgment and continuity judgment is adopted to prevent fault false alarms caused by instantaneous abnormality of the system, so that the accuracy of fault diagnosis is ensured;

3) Multilevel health assessment method based on multiple characteristics: according to the method, system operation health condition assessment is carried out according to the multi-element characteristics of the software and hardware system of the reinforcement server, health assessment from the system parameter state to the system component state and finally to the system state is realized, a parameter-level, component-level and system-level health assessment system is constructed, and a multi-level system health assessment function is realized.

4) The fault prediction method based on time sequence prediction comprises the following steps: the method uses a time sequence prediction model to predict the future state of relevant parameters of the software and hardware systems of the reinforcement server, and performs fault analysis and diagnosis according to the prediction result, thereby realizing the system fault prediction function. Further, the mode of online model updating is adopted, the problem that the model prediction precision is reduced due to large disturbance of external input is avoided, and the fault prediction precision is improved.

The system and the method for predicting the faults and managing the health of the reinforcement server take account of the out-of-band information and the in-band information of the software and hardware systems of the reinforcement server, realize the real-time monitoring of the running conditions of the software and hardware systems of the system, provide the functions of automatic diagnosis of abnormal states, system health assessment, system fault prediction and the like, provide necessary technical means for the health management of the reinforcement server, realize the autonomous diagnosis and autonomous guarantee of the software and hardware systems of the reinforcement server, realize the transition from the 'post maintenance' to the 'condition-based maintenance' of the reinforcement server equipment and the 'regular maintenance' to the 'maintenance based on the running states', and develop the maintenance and guarantee mode of the reinforcement server equipment to a more effective direction.

The invention provides a fault prediction and health management system and method for a reinforcement server, which adopt a multi-source data acquisition technology based on IPMI and system kernel analysis, a quantitative fault diagnosis technology based on fault mode tree, a health assessment technology based on system multi-element characteristics, a fault prediction technology based on data driving and the like to sense the state of a software and hardware system of the reinforcement server, monitor the running state of equipment, diagnose and locate the fault type of the abnormal state of the system through data monitoring and analysis, evaluate the running health of the system, predict the occurrence of faults, realize autonomous diagnosis and autonomous guarantee of the system, greatly improve the operation and maintenance efficiency of the system and improve the safety and reliability of the system.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A system for predicting failure and managing health of a hardened server, the system comprising: the system comprises a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module and a human-computer interaction interface;

2. The system for predicting and managing faults of a reinforcement server according to claim 1, wherein the data acquisition module and the data storage module are located in a data layer, the fault mode tree module and the real-time monitoring module are located in a monitoring layer, the fault diagnosis module, the health evaluation module and the fault prediction module are located in an application layer, and the human-computer interaction interface is located in an interaction layer.

3. The system for predicting and managing faults of a reinforcement server according to claim 1, wherein the data acquisition module acquires out-of-band information of the system by using an IPMI protocol, and the acquired data at least comprises system power supply information and hardware temperature information; the data acquisition module reads in-band information of the kernel file acquisition system of the operating system, and the acquired data at least comprises running state information of a CPU, a memory, a disk and a network component.

4. The system for predicting and managing the failure of a hardened server according to claim 1, wherein the failure mode tree is classified according to the locations of the components where the failure occurs, and the failure classification at the component level includes a CPU failure, a memory failure, a disk failure, a network failure, and a hardware failure, wherein the hardware failure is: system component voltage failure and lifetime component failure.

5. The server-consolidated fault prediction and health management system of claim 4, wherein the fault mode tree modifies the fault mode based on subsequent server component and application additions or modifications and fault mode updates, including parameter threshold settings, fault mode modifications, additions or deletions.

6. The system for predicting and managing faults of a reinforcement server according to any one of claims 1 to 5, wherein for fault false alarms caused by instantaneous abnormality of system parameter states, diagnosis is performed on system abnormal parameter state data by adopting a two-step judgment method consisting of probabilistic judgment and continuity judgment, the abnormal state is positioned as a parameter fault occurrence only when both the probabilistic judgment and the continuity judgment result are faults, and fault information is output according to a fault mode tree, otherwise only alarm information is output; the probability judgment is used for judging whether the fault occurs or not, and the principle is that whether the percentage of the time point of the abnormal state of the parameter in the total time exceeds a probability threshold value or not in the set time; the continuity judgment is used for judging whether the parameter abnormal state continuously occurs, and when the abnormal state continuously occurs, the fault occurrence is judged, and the continuity judgment is carried out on the basis of probabilistic judgment, and the principle is that whether the percentage of the maximum continuous time point of the parameter abnormal state to the total time of the parameter abnormal state exceeds a continuity threshold value or not in the specified time.

7. The server-consolidated fault prediction and health management system of claim 6, wherein the workflow of the fault diagnosis module comprises:

8. The server-consolidated fault prediction and health management system of claim 6,

the parameter level health assessment model builds a health assessment function according to the deviation degree of the current running state of the parameter and the threshold value of the critical abnormal state; when the parameter state approaches to the critical abnormal state, the change trend of the health state is increased, so that a nonlinear change mode is adopted to construct a parameter-level health evaluation function; parameters are divided into a constant type and a percentage type according to the state numerical value type, are divided into a single-side threshold type and a double-side threshold type according to a state threshold value taking interval, and are divided into an increment type and a decrement type according to the relation between a state value and the change of health degree; constructing corresponding health evaluation functions according to different types of parameters; because the parameter types related to the parameters of the computing system are five types of decreasing single-threshold interval percentage, increasing single-threshold interval percentage, decreasing single-threshold interval constant, increasing single-threshold interval constant and increasing double-threshold interval, the designed health evaluation function is as follows:

defining software and hardware system, components and parameters of reinforced computer, the defined system is formed from n components, and the number of parameters related to every component is m, then the state of j parameter of i component is expressed as x _ij Wherein i is [1, n ]]，j∈[1,m]The method comprises the steps of carrying out a first treatment on the surface of the For parameter-level health evaluation functions, different evaluation functions need to be designed according to parameter types;

1) Unilateral threshold percentage decrease parameter

wherein θ _ij Is the parameter x _ij A critical abnormal state threshold of (2);

2) Single-sided threshold percentage increment parameter

3) Single-sided threshold constant decrementing parameter

4) Single-sided threshold constant increment parameter

5) Bilateral threshold parameters

wherein, thereinAs the standard value of the parameter, θ _ij A fault critical value which is the deviation between the parameter state value and the parameter standard value;

for the component health assessment model: for the ith component, its health assessment function expression is as follows:

z _i ＝C _i ·Y _i

wherein C is _i ＝[c _i1 … c _im ]Is a weight matrix of parameters, a tableThe influence degree of each parameter on the component health degree is characterized, the weight value of each parameter is artificially determined according to expert knowledge and priori knowledge, and the weight value is configured and changed according to the actual use effect; y is Y _i ＝[y _i1 … y _im ] ^T Is a related parameter health matrix of the component;

for the system health assessment model: the system health assessment function expression is as follows:

u＝W·Z

wherein W= [ W ₁ … w _n ]The weight matrix is a weight matrix of the components, the influence degree of each component on the system health degree is represented, the weight value of each component is artificially determined by adopting expert knowledge and priori knowledge, and the weight value is configured and changed according to the actual use effect; z= [ Z ] ₁ … z _n ] ^T Is a health matrix of the components of the system.

9. The server-consolidated fault prediction and health management system of claim 8, wherein the workflow of the health assessment module is:

10. The server-consolidated fault prediction and health management system of claim 8, wherein the workflow of the fault prediction module comprises:

s43, carrying out data difference processing on non-stationary data;