CN116755964A - Fault prediction and health management system for reinforcement server - Google Patents

Fault prediction and health management system for reinforcement server Download PDF

Info

Publication number
CN116755964A
CN116755964A CN202310698766.0A CN202310698766A CN116755964A CN 116755964 A CN116755964 A CN 116755964A CN 202310698766 A CN202310698766 A CN 202310698766A CN 116755964 A CN116755964 A CN 116755964A
Authority
CN
China
Prior art keywords
parameter
fault
health
state
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310698766.0A
Other languages
Chinese (zh)
Inventor
程智鹏
刘宗宝
刘更
郭申
闵新宇
甄志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202310698766.0A priority Critical patent/CN116755964A/en
Publication of CN116755964A publication Critical patent/CN116755964A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention relates to a fault prediction and health management system of a reinforcement server, and belongs to the technical field of computer health management. The invention adopts a multisource data acquisition technology based on IPMI and system kernel analysis, a quantitative fault diagnosis technology based on a fault mode tree, a health assessment technology based on system multivariate characteristics, a fault prediction technology based on data driving and the like to sense the state of a software and hardware system of a reinforcement server, monitor the running condition of equipment, monitor and analyze the data, diagnose and locate the fault type of the abnormal state of the system, evaluate the running health of the system, predict the occurrence of faults, realize autonomous diagnosis and autonomous guarantee of the system, greatly improve the operation and maintenance efficiency of the system and improve the safety and reliability of the system.

Description

Fault prediction and health management system for reinforcement server
Technical Field
The invention belongs to the technical field of computer health management, and particularly relates to a fault prediction and health management system of a reinforcement server.
Background
The reinforcement server is used as a comprehensive data computing and information processing server, has the characteristics of high information processing speed, high reliability requirement and the like, and is widely applied to systems such as command control, information guarantee and the like. The reinforcement server is an integral system with the coupling of software and hardware, has very strict requirements on the stability of the system, and along with the continuous improvement of the functions and the performances of the reinforcement server system, the probability of faults and functional failures is increased, the types of the faults are increased, and how to effectively reduce the fault rate of the software and hardware system of the reinforcement server is an important problem which needs to be solved urgently by the software and hardware system of the reinforcement server.
At present, the maintenance and guarantee mode of the reinforcement server mainly adopts two modes of fault repair and timing maintenance, once equipment fails, shutdown repair is usually adopted, and all maintenance and guarantee measures are passive remedial measures based on the fault; for regular maintenance, no matter how good the actual working condition of the equipment is, according to the established maintenance time and maintenance strategy, regular maintenance, minor repair, middle repair and major repair are carried out regularly, and the problems of long maintenance time, high maintenance cost, poor pertinence and the like exist.
Aiming at the problems, by means of fault prediction and health management technology, the occurrence of faults of the software and hardware systems of the reinforcement server, the running health condition of the system is represented, maintenance planning and decision guarantee can be automatically identified, and the transition of a maintenance strategy from a fault-oriented 'planned maintenance' strategy and a fault-oriented 'post maintenance' strategy to an optionally maintenance strategy based on state monitoring and health management is realized, so that the use and maintenance cost of the reinforcement server is reduced, and the safety, the integrity and the task success of the software and hardware systems of the reinforcement server are improved.
Disclosure of Invention
First, the technical problem to be solved
The invention aims to solve the technical problems of long maintenance time, high maintenance cost, poor pertinence and the like in the prior art by providing a fault prediction and health management system for a reinforcement server.
(II) technical scheme
In order to solve the above technical problems, the present invention provides a system for predicting failure and managing health of a reinforcement server, the system comprising: the system comprises a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module and a human-computer interaction interface;
the data acquisition module acquires multi-source data of the reinforcement server system and provides data sources for other modules of the system, and the acquired data is divided into out-of-band information and in-band information of the system;
the data storage module is used for storing the collected system multi-source data into a database for system state backup and providing historical data of system parameter states for the fault prediction module;
the fault mode tree module is used for defining fault types of software and hardware systems of the reinforcement server, relevant parameters of the fault types and parameter fault thresholds, and constructing a fault mode tree by analyzing typical fault information of the reinforcement server;
the real-time monitoring module monitors the real-time state of each parameter of the system in real time according to the parameter state threshold value in the fault mode tree, when the parameter state exceeds the threshold value limit, the system is regarded as abnormal condition, the real-time monitoring module transmits the abnormal parameter type to the fault diagnosis module, and the fault diagnosis module is activated to realize autonomous fault diagnosis;
the fault diagnosis module analyzes the collected system abnormal parameter state data, and diagnoses whether the abnormal state is an instantaneous abnormal alarm or a fault by using two judging methods consisting of probabilistic judgment and continuity judgment;
the health evaluation module is used for constructing a parameter-level, component-level and system-level health evaluation model according to the state information of the multi-element system as a data basis and reflecting the current system health state of the system; the parameter level health evaluation model evaluates the running state of a single parameter and constructs a nonlinear evaluation function based on the deviation degree of the parameter state and the fault critical state; the component-level health evaluation model evaluates the health degree of each component of the system, and a health evaluation function of each component is constructed by aggregating the parameter health degree of each component and adopting a weighting method; likewise, the system-level health assessment model builds a health assessment function of the system by aggregating the health degrees of all the components and adopting a weighting method;
the fault prediction module predicts the future state of the parameter by adopting a data-driven time sequence prediction model according to the historical time sequence data of the state of the system parameter, and performs fault diagnosis on the future state of the system parameter according to a fault mode tree so as to complete fault prediction; the fault prediction module adopts an ARIMA model to construct a time sequence prediction model, and adopts an online model updating mode to automatically update the model;
the man-machine interaction interface displays the real-time state information, fault and alarm information and information related to the health degree and fault prediction result of the system, and simultaneously provides the configuration function of a system fault mode tree and the inquiry function of the historical state of the system.
(III) beneficial effects
The invention provides a fault prediction and health management system and method of a reinforcement server, wherein the system and the method adopt a multi-source data acquisition technology based on IPMI and system kernel analysis, a quantitative fault diagnosis technology based on fault mode tree, a health assessment technology based on system multi-element characteristics, a fault prediction technology based on data driving and the like to sense the state of a software and hardware system of the reinforcement server, monitor the running condition of equipment, monitor and analyze through data, diagnose and locate the fault type of the abnormal state of the system, evaluate the running health of the system, predict the occurrence of faults, realize autonomous diagnosis and autonomous guarantee of the system, greatly improve the operation and maintenance efficiency of the system and improve the safety and reliability of the system.
Drawings
FIG. 1 is a block diagram of a fault prediction and health management system of the present invention;
FIG. 2 is a block flow diagram of a fault diagnosis module of the present invention;
FIG. 3 is a block flow diagram of a health assessment module of the present invention;
FIG. 4 is a block flow diagram of a fault prediction module of the present invention.
Detailed Description
To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.
The invention relates to the technical field of computer health management, in particular to a system and a method for data acquisition, state monitoring, fault diagnosis, health assessment and fault prediction of a reinforcement server.
The invention aims to provide a fault prediction and health management method of a reinforcement server, which aims to solve the problem of how to design a fault prediction and health management system aiming at the application requirements of autonomous guarantee and autonomous diagnosis of operation and maintenance of the reinforcement server and provide functions of fault diagnosis, health assessment, fault prediction and the like.
The invention provides a fault prediction and health management system and method for a reinforcement server. The system aims at reinforcing the software and hardware systems of the server, performs operations such as data acquisition, real-time monitoring, fault diagnosis, health evaluation, fault prediction and the like, provides the running condition information of the software and hardware systems of the server, and provides necessary technical means for running maintenance of the server.
As shown in FIG. 1, the system and method for predicting and managing faults of a reinforcement server provided by the invention are divided into a data layer, a monitoring layer, an application layer and an interaction layer, and comprise a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module and a man-machine interaction interface. The system comprises a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module, a man-machine interaction interface and an interaction layer, wherein the data acquisition module and the data storage module are located on a data layer, the fault mode tree module and the real-time monitoring module are located on a monitoring layer, the fault diagnosis module, the health evaluation module and the fault prediction module are located on an application layer.
The data acquisition module acquires multi-source data of the reinforcement server system and provides data sources for other modules of the system, and the acquired data is divided into out-of-band information and in-band information of the system, and specifically comprises the following steps: the method comprises the steps of collecting out-of-band information of a system by using an IPMI protocol, wherein collected data at least comprises system power supply information, hardware temperature information and the like; the in-band information of the kernel file acquisition system of the operating system is read, and the acquired data at least comprises running state information of components such as a CPU, a memory, a disk, a network and the like, such as the utilization rate, the idle rate and the like.
The data storage module is used for storing the collected system multi-source data into a database for system state backup, providing historical data of system parameter states for the fault prediction module, and simultaneously providing a data basis for subsequent manual investigation and algorithm improvement.
The fault mode tree module is used for defining fault types, fault type related parameters and parameter fault thresholds of the software and hardware systems of the reinforcement server. The fault mode tree is constructed by analyzing typical fault information of the reinforcement server, and as shown in table 1, the fault classification at the component level at least comprises a CPU fault, a memory fault, a disk fault, a network fault, a hardware fault and the like, wherein the hardware fault is mainly a system component voltage fault and a life part (such as a fan) fault. In particular, the failure mode tree is configurable, and the failure mode may be modified according to subsequent additions or modifications to the server components and applications and updates to the failure mode, including parameter threshold settings, modifications to the failure mode, additions or deletions.
TABLE 1 failure mode Tree
And the real-time monitoring module monitors the real-time state of each parameter of the system in real time according to the parameter state threshold value in the fault mode tree. When the parameter state exceeds the threshold limit, the system is considered to be in an abnormal condition, the real-time monitoring module transmits the abnormal parameter type to the fault diagnosis module, and the fault diagnosis module is activated to realize the autonomous fault diagnosis.
The fault diagnosis module analyzes the collected system abnormal parameter state data, and uses two judging methods consisting of probabilistic judgment and continuity judgment to diagnose whether the abnormal state is an instantaneous abnormal alarm or a fault. Aiming at fault false alarms caused by instantaneous abnormality of the system parameter state, the two-step judging method consisting of probabilistic judgment and continuity judgment is adopted to diagnose the abnormal parameter state data of the system, the abnormal state is positioned as parameter fault occurrence only when the probabilistic judgment and the continuity judgment result are faults, fault information is output according to a fault mode tree, and otherwise, only alarm information is output. The probability judgment is used for judging whether the fault occurs or not, and the principle is that whether the percentage of the time point of the abnormal state of the parameter in the total time exceeds a probability threshold value or not in the specified time. The continuity judgment is used for judging whether the parameter abnormal state continuously occurs or not, and judging that the fault occurs when the abnormal state continuously occurs. The continuity judgment is based on probabilistic judgment, and the principle is that whether the maximum continuous time point of the abnormal state of the parameter accounts for the total time of the abnormal state of the parameter in a specified time exceeds a continuity threshold value or not. The thresholds for probabilistic and continuity determinations are configurable and can be modified by personnel based on actual results.
Specifically, the workflow of the fault diagnosis module is shown in fig. 2, and includes:
s21, acquiring parameter state data of N time points according to abnormal parameters and fault modes to which the parameters belong, wherein the abnormal parameters and the fault modes are fed back by the real-time monitoring module;
s22, analyzing and diagnosing the acquired parameter state data by using probabilistic judgment, outputting parameter abnormal state alarm information only when the probabilistic judgment result is negative, and triggering continuity judgment when the probabilistic judgment result is positive;
s23, analyzing and diagnosing the acquired parameter state data by using the continuity judgment, outputting parameter abnormal state warning information only when the continuity judgment result is negative, judging that the parameter abnormal state is a fault when the continuity judgment result is positive, and outputting fault mode information of the abnormal parameter.
And the health evaluation module is used for constructing a parameter-level, component-level and system-level health evaluation model according to the multivariate system state information as a data basis and reflecting the current system health state of the system. The parameter level health evaluation mainly evaluates the running state of a single parameter, and builds a nonlinear evaluation function based on the deviation degree of the parameter state and the fault critical state. The component-level health evaluation mainly evaluates the health degree of each component of the system, and a health evaluation function of each component is constructed by aggregating the parameter health degree of each component and adopting a weighting method. Likewise, system level health assessment builds a health assessment function of the system by aggregating the health of the various components using a weighted approach.
And for the parameter-level health evaluation model, constructing a health evaluation function according to the deviation degree of the current running state of the parameter and the threshold value of the critical abnormal state. When the parameter state approaches to the critical abnormal state, the change trend of the health state is increased, so that a nonlinear change mode is adopted to construct a parameter-level health evaluation function. Parameters can be classified into a constant type and a percentage type according to the state value type, can be classified into a single-side threshold type and a double-side threshold type according to the state threshold value interval, and can be classified into an increment type and a decrement type according to the relation between the state value and the change of the health degree. And constructing corresponding health evaluation functions according to different types of parameters. The parameter types related to the parameters of the computing system are five types of single-threshold interval percentage decrease, single-threshold interval percentage increase, single-threshold interval constant decrease, single-threshold interval constant increase and double-threshold interval.
For better describing the method, the software and hardware system, the components and the parameters of the reinforced computer are defined, the definition system is composed of n components, the number of the parameters related to each component is m, and the state of the j parameter of the i component is expressed as x ij Wherein i is [1, n ]],j∈[1,m]. For parameter level health assessment functions, different assessment functions need to be designed depending on the type of parameters.
1) Unilateral threshold percentage decrease parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij Is a critical abnormal state threshold of (2).
2) Single-sided threshold percentage increment parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij Is a critical abnormal state threshold of (2).
3) Single-sided threshold constant decrementing parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij Is a critical abnormal state threshold of (2).
4) Single-sided threshold constant increment parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij Is a critical abnormal state threshold of (2).
5) Bilateral threshold parameters
For parameter x ij In terms of this, the health assessment function is as follows:
wherein, thereinAs the standard value of the parameter, θ ij Is a fault threshold value where the parameter state value deviates from the parameter standard value.
For the component health assessment model:
for the ith component, its health assessment function expression is as follows:
z i =C i ·Y i
wherein C is i =[c i1 … c im ]The method is a weight matrix of parameters, the influence degree of each parameter on the health degree of the component is represented, the weight value of each parameter is artificially determined according to methods such as expert knowledge, priori knowledge and the like, and the weight value can be configured and changed according to the actual use effect; y is Y i =[y i1 … y im ] T Is a matrix of the health of the relevant parameters of the component.
For the system health assessment model:
the system health assessment function expression is as follows:
u=W·Z
wherein W= [ W 1 … w n ]The weight matrix is a weight matrix of the components, the influence degree of each component on the system health degree is represented, the weight value of each component is artificially determined by adopting methods such as expert knowledge, priori knowledge and the like, and the weight value can be configured and changed according to the actual use effect; z= [ Z ] 1 … z n ] T Is a health matrix of the components of the system.
Specifically, as shown in fig. 3, the workflow of the health assessment module is:
s31, according to the type of the system parameter, adopting a corresponding parameter health evaluation function to evaluate the health degree of each parameter;
s32, aggregating the health degrees of all the parameters according to the system components, and evaluating the health degrees of all the components according to the health evaluation functions of all the components;
s33, aggregating the health degree of each component of the system, and evaluating the overall health degree of the system according to a system health degree evaluation function.
The fault prediction module predicts the future state of the parameters by adopting a data-driven time sequence prediction model according to the historical time sequence data of the state of the system parameters, and performs fault diagnosis on the future state of the system parameters according to a fault mode tree, so that the fault prediction is completed. In the embodiment, an ARIMA model is adopted to construct a time sequence prediction model. Aiming at the problem that the software and hardware system of the reinforcement server is greatly influenced by the outside, the state of system parameters is greatly changed by the outside input, so that the prediction precision of the time sequence prediction model is reduced, and the model is automatically updated in an online model updating mode. The specific workflow of the failure prediction module is shown in fig. 4, and the specific steps are as follows:
s41, acquiring system parameter state history time sequence data related to a fault mode tree from a database;
s42, checking whether the time series data of each parameter are stable, if the checking result is non-stable, entering a step S43, and if the checking result is stable, entering a step S46;
s43, carrying out data difference processing on non-stationary data;
s44, performing stability verification on the processed data, if the verification result is stable, entering a step S45, and if the verification result is non-stable, returning to the step S43;
s45, updating a parameter state prediction model according to the time sequence data after the difference processing;
s46, predicting the parameter state in future time by using a parameter state prediction model and using the parameter time series data as a data basis;
s47, carrying out fault analysis and diagnosis on the future state of the system according to the prediction result of the future state of the parameter and the fault mode tree, and obtaining the future running condition of the system.
The man-machine interaction interface mainly displays related information such as real-time state information, fault and alarm information, health degree and fault prediction result of the system, and provides configuration function of a system fault mode tree, system history state query function and the like.
It can be seen that the invention mainly adopts the following technical means:
1) The multi-source data acquisition method comprises the following steps: the method collects out-of-band data information of the software and hardware systems of the reinforcement server through an IPMI protocol, collects in-band information of the software and hardware systems through a mode of analyzing an operating system kernel running file, and achieves a multi-source data collection function of the software and hardware systems of the server;
2) The fault diagnosis method based on the fault mode tree and quantitative analysis comprises the following steps: according to the method, a fault diagnosis method of quantitative analysis is constructed according to a fault mode tree of a server software and hardware system of a server typical fault framework, and meanwhile, a two-step method consisting of probabilistic judgment and continuity judgment is adopted to prevent fault false alarms caused by instantaneous abnormality of the system, so that the accuracy of fault diagnosis is ensured;
3) Multilevel health assessment method based on multiple characteristics: according to the method, system operation health condition assessment is carried out according to the multi-element characteristics of the software and hardware system of the reinforcement server, health assessment from the system parameter state to the system component state and finally to the system state is realized, a parameter-level, component-level and system-level health assessment system is constructed, and a multi-level system health assessment function is realized.
4) The fault prediction method based on time sequence prediction comprises the following steps: the method uses a time sequence prediction model to predict the future state of relevant parameters of the software and hardware systems of the reinforcement server, and performs fault analysis and diagnosis according to the prediction result, thereby realizing the system fault prediction function. Further, the mode of online model updating is adopted, the problem that the model prediction precision is reduced due to large disturbance of external input is avoided, and the fault prediction precision is improved.
The system and the method for predicting the faults and managing the health of the reinforcement server take account of the out-of-band information and the in-band information of the software and hardware systems of the reinforcement server, realize the real-time monitoring of the running conditions of the software and hardware systems of the system, provide the functions of automatic diagnosis of abnormal states, system health assessment, system fault prediction and the like, provide necessary technical means for the health management of the reinforcement server, realize the autonomous diagnosis and autonomous guarantee of the software and hardware systems of the reinforcement server, realize the transition from the 'post maintenance' to the 'condition-based maintenance' of the reinforcement server equipment and the 'regular maintenance' to the 'maintenance based on the running states', and develop the maintenance and guarantee mode of the reinforcement server equipment to a more effective direction.
The invention provides a fault prediction and health management system and method for a reinforcement server, which adopt a multi-source data acquisition technology based on IPMI and system kernel analysis, a quantitative fault diagnosis technology based on fault mode tree, a health assessment technology based on system multi-element characteristics, a fault prediction technology based on data driving and the like to sense the state of a software and hardware system of the reinforcement server, monitor the running state of equipment, diagnose and locate the fault type of the abnormal state of the system through data monitoring and analysis, evaluate the running health of the system, predict the occurrence of faults, realize autonomous diagnosis and autonomous guarantee of the system, greatly improve the operation and maintenance efficiency of the system and improve the safety and reliability of the system.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. A system for predicting failure and managing health of a hardened server, the system comprising: the system comprises a data acquisition module, a data storage module, a fault mode tree module, a real-time monitoring module, a fault diagnosis module, a health evaluation module, a fault prediction module and a human-computer interaction interface;
the data acquisition module acquires multi-source data of the reinforcement server system and provides data sources for other modules of the system, and the acquired data is divided into out-of-band information and in-band information of the system;
the data storage module is used for storing the collected system multi-source data into a database for system state backup and providing historical data of system parameter states for the fault prediction module;
the fault mode tree module is used for defining fault types of software and hardware systems of the reinforcement server, relevant parameters of the fault types and parameter fault thresholds, and constructing a fault mode tree by analyzing typical fault information of the reinforcement server;
the real-time monitoring module monitors the real-time state of each parameter of the system in real time according to the parameter state threshold value in the fault mode tree, when the parameter state exceeds the threshold value limit, the system is regarded as abnormal condition, the real-time monitoring module transmits the abnormal parameter type to the fault diagnosis module, and the fault diagnosis module is activated to realize autonomous fault diagnosis;
the fault diagnosis module analyzes the collected system abnormal parameter state data, and diagnoses whether the abnormal state is an instantaneous abnormal alarm or a fault by using two judging methods consisting of probabilistic judgment and continuity judgment;
the health evaluation module is used for constructing a parameter-level, component-level and system-level health evaluation model according to the state information of the multi-element system as a data basis and reflecting the current system health state of the system; the parameter level health evaluation model evaluates the running state of a single parameter and constructs a nonlinear evaluation function based on the deviation degree of the parameter state and the fault critical state; the component-level health evaluation model evaluates the health degree of each component of the system, and a health evaluation function of each component is constructed by aggregating the parameter health degree of each component and adopting a weighting method; likewise, the system-level health assessment model builds a health assessment function of the system by aggregating the health degrees of all the components and adopting a weighting method;
the fault prediction module predicts the future state of the parameter by adopting a data-driven time sequence prediction model according to the historical time sequence data of the state of the system parameter, and performs fault diagnosis on the future state of the system parameter according to a fault mode tree so as to complete fault prediction; the fault prediction module adopts an ARIMA model to construct a time sequence prediction model, and adopts an online model updating mode to automatically update the model;
the man-machine interaction interface displays the real-time state information, fault and alarm information and information related to the health degree and fault prediction result of the system, and simultaneously provides the configuration function of a system fault mode tree and the inquiry function of the historical state of the system.
2. The system for predicting and managing faults of a reinforcement server according to claim 1, wherein the data acquisition module and the data storage module are located in a data layer, the fault mode tree module and the real-time monitoring module are located in a monitoring layer, the fault diagnosis module, the health evaluation module and the fault prediction module are located in an application layer, and the human-computer interaction interface is located in an interaction layer.
3. The system for predicting and managing faults of a reinforcement server according to claim 1, wherein the data acquisition module acquires out-of-band information of the system by using an IPMI protocol, and the acquired data at least comprises system power supply information and hardware temperature information; the data acquisition module reads in-band information of the kernel file acquisition system of the operating system, and the acquired data at least comprises running state information of a CPU, a memory, a disk and a network component.
4. The system for predicting and managing the failure of a hardened server according to claim 1, wherein the failure mode tree is classified according to the locations of the components where the failure occurs, and the failure classification at the component level includes a CPU failure, a memory failure, a disk failure, a network failure, and a hardware failure, wherein the hardware failure is: system component voltage failure and lifetime component failure.
5. The server-consolidated fault prediction and health management system of claim 4, wherein the fault mode tree modifies the fault mode based on subsequent server component and application additions or modifications and fault mode updates, including parameter threshold settings, fault mode modifications, additions or deletions.
6. The system for predicting and managing faults of a reinforcement server according to any one of claims 1 to 5, wherein for fault false alarms caused by instantaneous abnormality of system parameter states, diagnosis is performed on system abnormal parameter state data by adopting a two-step judgment method consisting of probabilistic judgment and continuity judgment, the abnormal state is positioned as a parameter fault occurrence only when both the probabilistic judgment and the continuity judgment result are faults, and fault information is output according to a fault mode tree, otherwise only alarm information is output; the probability judgment is used for judging whether the fault occurs or not, and the principle is that whether the percentage of the time point of the abnormal state of the parameter in the total time exceeds a probability threshold value or not in the set time; the continuity judgment is used for judging whether the parameter abnormal state continuously occurs, and when the abnormal state continuously occurs, the fault occurrence is judged, and the continuity judgment is carried out on the basis of probabilistic judgment, and the principle is that whether the percentage of the maximum continuous time point of the parameter abnormal state to the total time of the parameter abnormal state exceeds a continuity threshold value or not in the specified time.
7. The server-consolidated fault prediction and health management system of claim 6, wherein the workflow of the fault diagnosis module comprises:
s21, acquiring parameter state data of N time points according to abnormal parameters and fault modes to which the parameters belong, wherein the abnormal parameters and the fault modes are fed back by the real-time monitoring module;
s22, analyzing and diagnosing the acquired parameter state data by using probabilistic judgment, outputting parameter abnormal state alarm information only when the probabilistic judgment result is negative, and triggering continuity judgment when the probabilistic judgment result is positive;
s23, analyzing and diagnosing the acquired parameter state data by using the continuity judgment, outputting parameter abnormal state warning information only when the continuity judgment result is negative, judging that the parameter abnormal state is a fault when the continuity judgment result is positive, and outputting fault mode information of the abnormal parameter.
8. The server-consolidated fault prediction and health management system of claim 6,
the parameter level health assessment model builds a health assessment function according to the deviation degree of the current running state of the parameter and the threshold value of the critical abnormal state; when the parameter state approaches to the critical abnormal state, the change trend of the health state is increased, so that a nonlinear change mode is adopted to construct a parameter-level health evaluation function; parameters are divided into a constant type and a percentage type according to the state numerical value type, are divided into a single-side threshold type and a double-side threshold type according to a state threshold value taking interval, and are divided into an increment type and a decrement type according to the relation between a state value and the change of health degree; constructing corresponding health evaluation functions according to different types of parameters; because the parameter types related to the parameters of the computing system are five types of decreasing single-threshold interval percentage, increasing single-threshold interval percentage, decreasing single-threshold interval constant, increasing single-threshold interval constant and increasing double-threshold interval, the designed health evaluation function is as follows:
defining software and hardware system, components and parameters of reinforced computer, the defined system is formed from n components, and the number of parameters related to every component is m, then the state of j parameter of i component is expressed as x ij Wherein i is [1, n ]],j∈[1,m]The method comprises the steps of carrying out a first treatment on the surface of the For parameter-level health evaluation functions, different evaluation functions need to be designed according to parameter types;
1) Unilateral threshold percentage decrease parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij A critical abnormal state threshold of (2);
2) Single-sided threshold percentage increment parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij A critical abnormal state threshold of (2);
3) Single-sided threshold constant decrementing parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij A critical abnormal state threshold of (2);
4) Single-sided threshold constant increment parameter
For parameter x ij In terms of this, the health assessment function is as follows:
wherein θ ij Is the parameter x ij A critical abnormal state threshold of (2);
5) Bilateral threshold parameters
For parameter x ij In terms of this, the health assessment function is as follows:
wherein, thereinAs the standard value of the parameter, θ ij A fault critical value which is the deviation between the parameter state value and the parameter standard value;
for the component health assessment model: for the ith component, its health assessment function expression is as follows:
z i =C i ·Y i
wherein C is i =[c i1 … c im ]Is a weight matrix of parameters, a tableThe influence degree of each parameter on the component health degree is characterized, the weight value of each parameter is artificially determined according to expert knowledge and priori knowledge, and the weight value is configured and changed according to the actual use effect; y is Y i =[y i1 … y im ] T Is a related parameter health matrix of the component;
for the system health assessment model: the system health assessment function expression is as follows:
u=W·Z
wherein W= [ W 1 … w n ]The weight matrix is a weight matrix of the components, the influence degree of each component on the system health degree is represented, the weight value of each component is artificially determined by adopting expert knowledge and priori knowledge, and the weight value is configured and changed according to the actual use effect; z= [ Z ] 1 … z n ] T Is a health matrix of the components of the system.
9. The server-consolidated fault prediction and health management system of claim 8, wherein the workflow of the health assessment module is:
s31, according to the type of the system parameter, adopting a corresponding parameter health evaluation function to evaluate the health degree of each parameter;
s32, aggregating the health degrees of all the parameters according to the system components, and evaluating the health degrees of all the components according to the health evaluation functions of all the components;
s33, aggregating the health degree of each component of the system, and evaluating the overall health degree of the system according to a system health degree evaluation function.
10. The server-consolidated fault prediction and health management system of claim 8, wherein the workflow of the fault prediction module comprises:
s41, acquiring system parameter state history time sequence data related to a fault mode tree from a database;
s42, checking whether the time series data of each parameter are stable, if the checking result is non-stable, entering a step S43, and if the checking result is stable, entering a step S46;
s43, carrying out data difference processing on non-stationary data;
s44, performing stability verification on the processed data, if the verification result is stable, entering a step S45, and if the verification result is non-stable, returning to the step S43;
s45, updating a parameter state prediction model according to the time sequence data after the difference processing;
s46, predicting the parameter state in future time by using a parameter state prediction model and using the parameter time series data as a data basis;
s47, carrying out fault analysis and diagnosis on the future state of the system according to the prediction result of the future state of the parameter and the fault mode tree, and obtaining the future running condition of the system.
CN202310698766.0A 2023-06-13 2023-06-13 Fault prediction and health management system for reinforcement server Pending CN116755964A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310698766.0A CN116755964A (en) 2023-06-13 2023-06-13 Fault prediction and health management system for reinforcement server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310698766.0A CN116755964A (en) 2023-06-13 2023-06-13 Fault prediction and health management system for reinforcement server

Publications (1)

Publication Number Publication Date
CN116755964A true CN116755964A (en) 2023-09-15

Family

ID=87954701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310698766.0A Pending CN116755964A (en) 2023-06-13 2023-06-13 Fault prediction and health management system for reinforcement server

Country Status (1)

Country Link
CN (1) CN116755964A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196591A (en) * 2023-11-07 2023-12-08 成都理工大学 Equipment failure mode prediction and residual life prediction coupling system and method
CN117482443A (en) * 2023-12-31 2024-02-02 常州博瑞电力自动化设备有限公司 Fire-fighting equipment health monitoring method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196591A (en) * 2023-11-07 2023-12-08 成都理工大学 Equipment failure mode prediction and residual life prediction coupling system and method
CN117196591B (en) * 2023-11-07 2024-02-09 成都理工大学 Equipment failure mode prediction and residual life prediction coupling system and method
CN117482443A (en) * 2023-12-31 2024-02-02 常州博瑞电力自动化设备有限公司 Fire-fighting equipment health monitoring method and system
CN117482443B (en) * 2023-12-31 2024-03-29 常州博瑞电力自动化设备有限公司 Fire-fighting equipment health monitoring method and system

Similar Documents

Publication Publication Date Title
CN116755964A (en) Fault prediction and health management system for reinforcement server
US8732534B2 (en) Predictive incident management
CN114267178B (en) Intelligent operation maintenance method and device for station
CN113708493A (en) Cloud edge cooperation-based power distribution terminal operation and maintenance method and device and computer equipment
CN117176560B (en) Monitoring equipment supervision system and method based on Internet of things
CN111581889A (en) Fault prediction method, system and equipment for heating equipment assembly
CN117713221B (en) Micro-inversion photovoltaic grid-connected optimization system
CN117689214B (en) Dynamic safety assessment method for energy router of flexible direct-current traction power supply system
KR20230036776A (en) System and method for fault diagnosis of fuel cell energy management system based on digital twin
CN117761444B (en) Method and system for monitoring service life of surge protector
CN117114454B (en) DC sleeve state evaluation method and system based on Apriori algorithm
CN117331794A (en) Big data-based application software monitoring analysis system and method
CN117572808A (en) Equipment monitoring method, device and equipment
KR102411915B1 (en) System and method for froviding real time monitering and ai diagnosing abnormality sign for facilities and equipments
CN115237719A (en) Early warning method and system for reliability of server power supply
CN115222069A (en) Equipment pre-diagnosis maintenance algorithm and intelligent factory management and control architecture integrating same
CN114168409A (en) Service system running state monitoring and early warning method and system
Ragab et al. Artificial Intelligence-Based Survival Analysis For Industrial Equipment Performance Management
CN117376108B (en) Intelligent operation and maintenance method and system for Internet of things equipment
CN117853087A (en) Data analysis system of power equipment
CN118089287A (en) Water chiller energy efficiency optimizing system based on intelligent algorithm
CN117194154A (en) APM full-link monitoring system and method based on micro-service
CN116957328A (en) Active medical instrument fault adverse event risk early warning method and system
Zhang et al. Joint decision-making model of preventive maintenance and delayed monitoring SPC based on imperialist competitive algorithm
CN117833452A (en) Intelligent power operation and maintenance method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination