CN112445632A - HPC reliability evaluation method based on fault data modeling - Google Patents

HPC reliability evaluation method based on fault data modeling Download PDF

Info

Publication number
CN112445632A
CN112445632A CN201910831168.XA CN201910831168A CN112445632A CN 112445632 A CN112445632 A CN 112445632A CN 201910831168 A CN201910831168 A CN 201910831168A CN 112445632 A CN112445632 A CN 112445632A
Authority
CN
China
Prior art keywords
fault data
time
fault
reliability
time interval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910831168.XA
Other languages
Chinese (zh)
Inventor
刘睿涛
钱宇
龚道永
李伟东
宋长明
张宏宇
刘沙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910831168.XA priority Critical patent/CN112445632A/en
Publication of CN112445632A publication Critical patent/CN112445632A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention discloses a HPC reliability evaluation method based on fault data modeling, which comprises the following steps: acquiring fault data of all fault units of a target system; based on the fault severity level, fault classification is carried out on the collected fault data, and the fault data is divided into severe fault data and non-severe fault data; screening out non-critical fault data not related to failure; selecting a time interval, taking all serious fault data in the time interval as sampling samples, and calculating weibull distribution parameters by adopting a maximum likelihood estimation method to obtain a failure time distribution model of a target system in the time interval; calculating MTTF (maximum Transmission transfer function) of a target system in a time interval, namely characteristic values of weibull distribution; and evaluating the reliability change characteristics of the target system according to the MTTF change of the target system in different time intervals. The invention can truly reflect the reliability index of the system in operation, can analyze the reliability level of the system in different time periods on line and guide the fault tolerance and daily operation and maintenance of the system.

Description

HPC reliability evaluation method based on fault data modeling
Technical Field
The invention relates to a HPC reliability evaluation method based on fault data modeling, and belongs to the technical field of computers.
Background
The HPC reliability on-line analysis has important guiding significance for evaluating the operation and maintenance level of the system, controlling whether the system runs well or not and evaluating the reliability and availability of the system. Currently, in a supercomputer, reliability evaluation usually adopts an exponential distribution model, and data analysis shows that the exponential distribution model is not suitable for increasingly complex supercomputer fault characteristics.
Disclosure of Invention
The invention aims to provide an HPC reliability evaluation method based on fault data modeling, which can truly reflect the reliability index of a system in operation, can analyze the reliability level of the system in different time periods on line and guide the fault tolerance and daily operation and maintenance of the system.
In order to achieve the purpose, the invention adopts the technical scheme that: a HPC reliability assessment method based on fault data modeling comprises the following steps:
s1, acquiring fault data of all fault units of the target system;
s2, based on the fault severity level, fault classification is carried out on the fault data collected in the S1, and the fault data are divided into serious fault data and non-serious fault data;
s3, screening out the non-serious fault data which are obtained in the S2 and are not related to the failure, and keeping the serious fault data, wherein the element in the serious fault data is (F)i,Ti) Of 2-dimensional vector of (2), wherein FiIndicating a type of catastrophic failure, TiIndicating the time of occurrence of the catastrophic failure;
s4, taking the serious fault data obtained in S3 as a total sample, selecting a time interval, taking all the serious fault data in the time interval as sampling samples, and taking the elements in the sampling samples according to the strict criterionHeavy fault occurrence time component (T)i) Arranging the samples in a descending order, taking the absolute value of the difference between the time components of adjacent sample elements as a serious fault interval time sample, taking the serious fault interval time sample as input data, and calculating a weibull distribution parameter by adopting a maximum likelihood estimation method to obtain a failure time distribution model of the target system in the time interval, wherein the density function of the weibull distribution is as follows:
Figure BDA0002190752950000011
wherein m is a shape parameter and η is a characteristic lifetime;
s5, calculating Mean Time To Failure (MTTF) of the target system over the time interval based on the weibull distribution parameters obtained in S4, that is, a characteristic value of weibull distribution:
Figure BDA0002190752950000012
wherein Γ (x) is a Gamma function;
and S6, evaluating the reliability change characteristics of the target system according to the Mean Time To Failure (MTTF) change of the target system in different time intervals, wherein the MTTF in different time intervals represents the reliability of the HPC system in the time interval, and the higher the MTTF is, the higher the reliability is, otherwise, the lower the reliability is.
The further improved scheme in the technical scheme is as follows:
1. in the above-described configuration, in S4, the time zone in which the reliability of the system needs to be measured is set as the selected time interval.
2. In the above-described configuration, in S4, a time period having a typical characteristic among time periods in which the reliability of the system needs to be measured is set as the selected time interval.
3. In the above scheme, the typical time period is a time interval for running the parallel application, a period for running and maintaining the system, a whole month, a whole quarter or a whole year.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the HPC reliability evaluation method based on fault data modeling is based on HPC real fault data, and faults affecting reliability are screened through fault big data and fault classification, so that data analysis is optimized, reliability indexes in system operation can be truly reflected, the reliability levels of the system in different periods can be analyzed on line, system fault tolerance and daily operation and maintenance are guided, and reliability changes of the system in different periods are evaluated through contrastive analysis of expected values of service life distribution in different periods.
Drawings
FIG. 1 is a schematic diagram of a fitting analysis of a plurality of life distributions with actual fault data;
FIG. 2 is a flow chart of the HPC reliability assessment method based on fault data modeling according to the present invention.
Detailed Description
Example (b): a HPC reliability assessment method based on fault data modeling comprises the following steps:
s1, acquiring fault data of all fault units of the target system;
s2, based on the fault severity level, fault classification is carried out on the fault data collected in the S1, and the fault data are divided into serious fault data and non-serious fault data;
s3, screening out the non-serious fault data which are obtained in the S2 and are not related to the failure, and keeping the serious fault data, wherein the element in the serious fault data is (F)i,Ti) Of 2-dimensional vector of (2), wherein FiIndicating a type of catastrophic failure, TiIndicating the time of occurrence of the catastrophic failure;
s4, taking the serious fault data obtained in S3 as a total sample, selecting a time interval, taking all the serious fault data in the time interval as sampling samples, and taking elements in the sampling samples according to the serious fault occurrence time component (T)i) Arranging the samples in a descending order, taking the absolute value of the difference of the time components of the adjacent sample elements as a serious fault interval time sample, taking the serious fault interval time sample as input data, adopting a maximum likelihood estimation method to calculate the weibull distribution parameter, and obtaining a failure time distribution model of the target system in the time interval and a density function of weibull distributionThe number is as follows:
Figure BDA0002190752950000031
wherein m is a shape parameter and η is a characteristic lifetime;
s5, calculating Mean Time To Failure (MTTF) of the target system over the time interval based on the weibull distribution parameters obtained in S4, that is, a characteristic value of weibull distribution:
Figure BDA0002190752950000032
wherein Γ (x) is a Gamma function;
and S6, evaluating the reliability change characteristics of the target system according to the Mean Time To Failure (MTTF) change of the target system in different time intervals, wherein the MTTF in different time intervals represents the reliability of the HPC system in the time interval, and the higher the MTTF is, the higher the reliability is, otherwise, the lower the reliability is.
The higher the reliability, the higher the system operation and maintenance level and the system component reliability which affect the reliability. MTTF in different time intervals can be drawn into a reliability change curve (the abscissa is the time interval, and the ordinate is the corresponding MTTF), the reliability change characteristics of the system can be visually displayed, the reliability bottleneck interval can be found, and clues are provided for further deeply analyzing fault influence sources.
In S4, the time zone in which the reliability of the system needs to be measured is set as the selected time interval.
In S4, a time period having a typical characteristic among the time periods in which the reliability of the system is to be measured is set as the selected time interval.
The above-mentioned typical time periods are time intervals in which parallel applications are run.
The examples are further explained below:
in general, the distribution is exponential (T to E (. lamda.)), or lognormal (T to LN (. mu.,. sigma.))2) Typical lifetime distribution models include weibull distribution (T to W (m, η)) and gamma distribution (T to gamma (. alpha.,. lambda.)).
The mathematical models are selected, different time intervals are selected according to needs, and the failure time characteristics of each fault entity are contrastively analyzed on multiple spatial dimensions. And fitting the parameters of each distribution by adopting a maximum likelihood estimation method to enable the parameters to approach the actual cumulative failure distribution data as much as possible. To test the fit of each distribution to the actual data, model evaluations were performed using the Kolmogorov-Smirnov test, yielding P values as model fitness evaluation parameters. The lower the P value, the less good the conformity is considered; otherwise, the better. Typically, the value of P is required to be greater than the threshold of 0.05 to consider the distribution to match the actual data.
According to the actual failure data of the supercomputer, the fault interval time distribution of the CPU node, the operation plug-in board and the complete machine can be described by using weibull distribution.
The weibull distribution can be chosen as the distribution of the failure times of the system. The failure distribution analysis of the whole system is carried out by taking the operation plug-in board of the main part of the host system as a basic part, so that the reliability analysis of the host system can be carried out. The time between failures of the system conforms to the weibull (m, η) distribution. Eigenvalues of weibull distribution
Figure BDA0002190752950000041
Wherein Γ (x) is a Gamma function. And E (T) is the mean time between failures of the system. Accordingly, the (hardware) reliability (MTTF) of the whole system can be calculated, and the reliability of the system at different stages is different.
The fault classification method is that each fault is divided into a non-serious fault and a serious fault according to the severity and the processing mode of the fault. Non-critical faults refer to: abnormal state which can not cause system failure or abnormal state which can be corrected by hardware. A critical failure is an abnormal state that immediately results in a system failure or an abnormal state in which fault tolerance intervention by the software system is necessary. Reliability analysis refers to filtering out non-critical faults and only preserving critical faults that lead to system failure.
The main flow of the reliability evaluation method is as follows: collecting fault data of a target system; classifying the faults based on the fault severity level, and classifying the faults into serious faults and non-serious faults; screening out non-critical faults not related to failure; selecting a time interval, taking all serious fault samples in the time interval as samples, and arranging sample elements in sequence from small to large according to the fault occurrence time. The absolute value of the difference between the time components of adjacent sample elements is then taken as a critical fault interval time sample. Calculating a weibull distribution parameter by taking the fault interval time sample as input data and adopting a maximum likelihood estimation method;
the method carries out comprehensive fault collection on all fault units of the supercomputer, carries out fault classification, distinguishes preprocessing measures such as fault grades and screens out required fault data. And then, performing comprehensive failure analysis on a plurality of different time intervals after the system operates, and finding out a failure time distribution model of the supercomputer in each time interval through fitting analysis (fitting inspection with a candidate failure model) of real failure data of the system. On the basis of the failure distribution model, the Mean Time To Failure (MTTF) of the system at different time periods, i.e., the mathematical expectation value of the model, is calculated. The reliability level of HPC at different stages can thus be evaluated, which can further guide system management and fault tolerance optimization.
When the HPC reliability evaluation method based on fault data modeling is adopted, faults affecting reliability are screened through fault big data and fault classification based on HPC real fault data, so that data analysis is optimized, reliability indexes of the system in operation can be truly reflected, the reliability levels of the system in different periods can be analyzed on line, fault tolerance and daily operation and maintenance of the system are guided, and reliability changes of the system in different periods are evaluated through contrastive analysis of expected values of service life distribution in different periods.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (4)

1. A HPC reliability assessment method based on fault data modeling is characterized in that: the method comprises the following steps:
s1, acquiring fault data of all fault units of the target system;
s2, based on the fault severity level, fault classification is carried out on the fault data collected in the S1, and the fault data are divided into serious fault data and non-serious fault data;
s3, screening out the non-serious fault data which are obtained in the S2 and are not related to the failure, and keeping the serious fault data, wherein the element in the serious fault data is (F)i,Ti) Of 2-dimensional vector of (2), wherein FiIndicating a type of catastrophic failure, TiIndicating the time of occurrence of the catastrophic failure;
s4, taking the serious fault data obtained in S3 as a total sample, selecting a time interval, taking all the serious fault data in the time interval as sampling samples, and taking elements in the sampling samples according to the serious fault occurrence time component (T)i) Arranging the samples in a descending order, taking the absolute value of the difference between the time components of adjacent sample elements as a serious fault interval time sample, taking the serious fault interval time sample as input data, and calculating a weibull distribution parameter by adopting a maximum likelihood estimation method to obtain a failure time distribution model of the target system in the time interval, wherein the density function of the weibull distribution is as follows:
Figure FDA0002190752940000011
wherein m is a shape parameter and η is a characteristic lifetime;
s5, calculating Mean Time To Failure (MTTF) of the target system over the time interval based on the weibull distribution parameters obtained in S4, that is, a characteristic value of weibull distribution:
Figure FDA0002190752940000012
wherein Γ (x) is a Gamma function;
and S6, evaluating the reliability change characteristics of the target system according to the Mean Time To Failure (MTTF) change of the target system in different time intervals, wherein the MTTF in different time intervals represents the reliability of the HPC system in the time interval, and the higher the MTTF is, the higher the reliability is, otherwise, the lower the reliability is.
2. The HPC reliability assessment method based on fault data modeling according to claim 1, characterized in that: in S4, the time zone in which the reliability of the system needs to be measured is set as the selected time interval.
3. The HPC reliability assessment method based on fault data modeling according to claim 1, characterized in that: in S4, a time period having a typical characteristic among the time periods in which the reliability of the system is to be measured is set as the selected time interval.
4. The HPC reliability assessment method based on fault data modeling according to claim 3, characterized in that: the typical time periods are time intervals in which parallel applications are running, periods of system operation and maintenance, whole months, whole quarters or whole years.
CN201910831168.XA 2019-09-04 2019-09-04 HPC reliability evaluation method based on fault data modeling Withdrawn CN112445632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910831168.XA CN112445632A (en) 2019-09-04 2019-09-04 HPC reliability evaluation method based on fault data modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910831168.XA CN112445632A (en) 2019-09-04 2019-09-04 HPC reliability evaluation method based on fault data modeling

Publications (1)

Publication Number Publication Date
CN112445632A true CN112445632A (en) 2021-03-05

Family

ID=74734358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910831168.XA Withdrawn CN112445632A (en) 2019-09-04 2019-09-04 HPC reliability evaluation method based on fault data modeling

Country Status (1)

Country Link
CN (1) CN112445632A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113757223A (en) * 2021-09-06 2021-12-07 徐工集团工程机械有限公司 Method and system for analyzing reliability of hydraulic component, computer device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘睿涛: "超级计算机故障分析、建模与预测技术研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113757223A (en) * 2021-09-06 2021-12-07 徐工集团工程机械有限公司 Method and system for analyzing reliability of hydraulic component, computer device and storage medium
CN113757223B (en) * 2021-09-06 2023-11-03 江苏徐工工程机械研究院有限公司 Hydraulic component reliability analysis method and system, computer device, and storage medium

Similar Documents

Publication Publication Date Title
CN108520357B (en) Method and device for judging line loss abnormality reason and server
CN104520806B (en) Abnormality detection for cloud monitoring
CN110046453A (en) Service life prediction method of laser radar
CN107679734A (en) It is a kind of to be used for the method and system without label data classification prediction
CN112083244A (en) Integrated avionics equipment fault intelligent diagnosis system
CN112116198A (en) Data-driven process industrial state perception network key node screening method
CN112180230A (en) Chip test parameter abnormity detection method, storage medium and terminal
Kitchenham et al. Design metrics in practice
CN111931334A (en) Method and system for evaluating operation reliability of cable equipment
CN109308225B (en) Virtual machine abnormality detection method, device, equipment and storage medium
CN113642209B (en) Structure implantation fault response data acquisition and evaluation method based on digital twinning
CN116028887A (en) Analysis method of continuous industrial production data
CN111061998A (en) Analysis model and method for economic measurement
CN116954624B (en) Compiling method based on software development kit, software development system and server
CN113837591A (en) Equipment health assessment method oriented to multi-working-condition operation conditions
CN112445632A (en) HPC reliability evaluation method based on fault data modeling
CN115114124A (en) Host risk assessment method and device
Filz et al. Data-driven analysis of product property propagation to support process-integrated quality management in manufacturing systems
CN111914424A (en) Design wind speed value taking method and system based on short-term wind measurement data
KR100987124B1 (en) Apparatus and Method for Software Faults Prediction using Metrics
CN108446213A (en) A kind of static code mass analysis method and device
CN111507374A (en) Power grid mass data anomaly detection method based on random matrix theory
CN115809805A (en) Power grid multi-source data processing method based on edge calculation
Pavasson et al. Variation mode and effect analysis compared to FTA and FMEA in product development
CN110210066B (en) Consistency test method for performance degradation data and fault data based on p value

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210305

WW01 Invention patent application withdrawn after publication