CN112445632A

CN112445632A - HPC reliability evaluation method based on fault data modeling

Info

Publication number: CN112445632A
Application number: CN201910831168.XA
Authority: CN
Inventors: 刘睿涛; 钱宇; 龚道永; 李伟东; 宋长明; 张宏宇; 刘沙
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2021-03-05

Abstract

The invention discloses a HPC reliability evaluation method based on fault data modeling, which comprises the following steps: acquiring fault data of all fault units of a target system; based on the fault severity level, fault classification is carried out on the collected fault data, and the fault data is divided into severe fault data and non-severe fault data; screening out non-critical fault data not related to failure; selecting a time interval, taking all serious fault data in the time interval as sampling samples, and calculating weibull distribution parameters by adopting a maximum likelihood estimation method to obtain a failure time distribution model of a target system in the time interval; calculating MTTF (maximum Transmission transfer function) of a target system in a time interval, namely characteristic values of weibull distribution; and evaluating the reliability change characteristics of the target system according to the MTTF change of the target system in different time intervals. The invention can truly reflect the reliability index of the system in operation, can analyze the reliability level of the system in different time periods on line and guide the fault tolerance and daily operation and maintenance of the system.

Description

HPC reliability evaluation method based on fault data modeling

Technical Field

The invention relates to a HPC reliability evaluation method based on fault data modeling, and belongs to the technical field of computers.

Background

The HPC reliability on-line analysis has important guiding significance for evaluating the operation and maintenance level of the system, controlling whether the system runs well or not and evaluating the reliability and availability of the system. Currently, in a supercomputer, reliability evaluation usually adopts an exponential distribution model, and data analysis shows that the exponential distribution model is not suitable for increasingly complex supercomputer fault characteristics.

Disclosure of Invention

The invention aims to provide an HPC reliability evaluation method based on fault data modeling, which can truly reflect the reliability index of a system in operation, can analyze the reliability level of the system in different time periods on line and guide the fault tolerance and daily operation and maintenance of the system.

In order to achieve the purpose, the invention adopts the technical scheme that: a HPC reliability assessment method based on fault data modeling comprises the following steps:

s1, acquiring fault data of all fault units of the target system;

s2, based on the fault severity level, fault classification is carried out on the fault data collected in the S1, and the fault data are divided into serious fault data and non-serious fault data;

s3, screening out the non-serious fault data which are obtained in the S2 and are not related to the failure, and keeping the serious fault data, wherein the element in the serious fault data is (F)_i，T_i) Of 2-dimensional vector of (2), wherein F_iIndicating a type of catastrophic failure, T_iIndicating the time of occurrence of the catastrophic failure;

s4, taking the serious fault data obtained in S3 as a total sample, selecting a time interval, taking all the serious fault data in the time interval as sampling samples, and taking the elements in the sampling samples according to the strict criterionHeavy fault occurrence time component (T)_i) Arranging the samples in a descending order, taking the absolute value of the difference between the time components of adjacent sample elements as a serious fault interval time sample, taking the serious fault interval time sample as input data, and calculating a weibull distribution parameter by adopting a maximum likelihood estimation method to obtain a failure time distribution model of the target system in the time interval, wherein the density function of the weibull distribution is as follows:

wherein m is a shape parameter and η is a characteristic lifetime;

s5, calculating Mean Time To Failure (MTTF) of the target system over the time interval based on the weibull distribution parameters obtained in S4, that is, a characteristic value of weibull distribution:

wherein Γ (x) is a Gamma function;

and S6, evaluating the reliability change characteristics of the target system according to the Mean Time To Failure (MTTF) change of the target system in different time intervals, wherein the MTTF in different time intervals represents the reliability of the HPC system in the time interval, and the higher the MTTF is, the higher the reliability is, otherwise, the lower the reliability is.

The further improved scheme in the technical scheme is as follows:

1. in the above-described configuration, in S4, the time zone in which the reliability of the system needs to be measured is set as the selected time interval.

2. In the above-described configuration, in S4, a time period having a typical characteristic among time periods in which the reliability of the system needs to be measured is set as the selected time interval.

3. In the above scheme, the typical time period is a time interval for running the parallel application, a period for running and maintaining the system, a whole month, a whole quarter or a whole year.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the HPC reliability evaluation method based on fault data modeling is based on HPC real fault data, and faults affecting reliability are screened through fault big data and fault classification, so that data analysis is optimized, reliability indexes in system operation can be truly reflected, the reliability levels of the system in different periods can be analyzed on line, system fault tolerance and daily operation and maintenance are guided, and reliability changes of the system in different periods are evaluated through contrastive analysis of expected values of service life distribution in different periods.

Drawings

FIG. 1 is a schematic diagram of a fitting analysis of a plurality of life distributions with actual fault data;

FIG. 2 is a flow chart of the HPC reliability assessment method based on fault data modeling according to the present invention.

Detailed Description

Example (b): a HPC reliability assessment method based on fault data modeling comprises the following steps:

s1, acquiring fault data of all fault units of the target system;

s4, taking the serious fault data obtained in S3 as a total sample, selecting a time interval, taking all the serious fault data in the time interval as sampling samples, and taking elements in the sampling samples according to the serious fault occurrence time component (T)_i) Arranging the samples in a descending order, taking the absolute value of the difference of the time components of the adjacent sample elements as a serious fault interval time sample, taking the serious fault interval time sample as input data, adopting a maximum likelihood estimation method to calculate the weibull distribution parameter, and obtaining a failure time distribution model of the target system in the time interval and a density function of weibull distributionThe number is as follows:

wherein m is a shape parameter and η is a characteristic lifetime;

wherein Γ (x) is a Gamma function;

The higher the reliability, the higher the system operation and maintenance level and the system component reliability which affect the reliability. MTTF in different time intervals can be drawn into a reliability change curve (the abscissa is the time interval, and the ordinate is the corresponding MTTF), the reliability change characteristics of the system can be visually displayed, the reliability bottleneck interval can be found, and clues are provided for further deeply analyzing fault influence sources.

In S4, the time zone in which the reliability of the system needs to be measured is set as the selected time interval.

In S4, a time period having a typical characteristic among the time periods in which the reliability of the system is to be measured is set as the selected time interval.

The above-mentioned typical time periods are time intervals in which parallel applications are run.

The examples are further explained below:

in general, the distribution is exponential (T to E (. lamda.)), or lognormal (T to LN (. mu.,. sigma.))²) Typical lifetime distribution models include weibull distribution (T to W (m, η)) and gamma distribution (T to gamma (. alpha.,. lambda.)).

The mathematical models are selected, different time intervals are selected according to needs, and the failure time characteristics of each fault entity are contrastively analyzed on multiple spatial dimensions. And fitting the parameters of each distribution by adopting a maximum likelihood estimation method to enable the parameters to approach the actual cumulative failure distribution data as much as possible. To test the fit of each distribution to the actual data, model evaluations were performed using the Kolmogorov-Smirnov test, yielding P values as model fitness evaluation parameters. The lower the P value, the less good the conformity is considered; otherwise, the better. Typically, the value of P is required to be greater than the threshold of 0.05 to consider the distribution to match the actual data.

According to the actual failure data of the supercomputer, the fault interval time distribution of the CPU node, the operation plug-in board and the complete machine can be described by using weibull distribution.

The weibull distribution can be chosen as the distribution of the failure times of the system. The failure distribution analysis of the whole system is carried out by taking the operation plug-in board of the main part of the host system as a basic part, so that the reliability analysis of the host system can be carried out. The time between failures of the system conforms to the weibull (m, η) distribution. Eigenvalues of weibull distribution

Wherein Γ (x) is a Gamma function. And E (T) is the mean time between failures of the system. Accordingly, the (hardware) reliability (MTTF) of the whole system can be calculated, and the reliability of the system at different stages is different.

The fault classification method is that each fault is divided into a non-serious fault and a serious fault according to the severity and the processing mode of the fault. Non-critical faults refer to: abnormal state which can not cause system failure or abnormal state which can be corrected by hardware. A critical failure is an abnormal state that immediately results in a system failure or an abnormal state in which fault tolerance intervention by the software system is necessary. Reliability analysis refers to filtering out non-critical faults and only preserving critical faults that lead to system failure.

The main flow of the reliability evaluation method is as follows: collecting fault data of a target system; classifying the faults based on the fault severity level, and classifying the faults into serious faults and non-serious faults; screening out non-critical faults not related to failure; selecting a time interval, taking all serious fault samples in the time interval as samples, and arranging sample elements in sequence from small to large according to the fault occurrence time. The absolute value of the difference between the time components of adjacent sample elements is then taken as a critical fault interval time sample. Calculating a weibull distribution parameter by taking the fault interval time sample as input data and adopting a maximum likelihood estimation method;

the method carries out comprehensive fault collection on all fault units of the supercomputer, carries out fault classification, distinguishes preprocessing measures such as fault grades and screens out required fault data. And then, performing comprehensive failure analysis on a plurality of different time intervals after the system operates, and finding out a failure time distribution model of the supercomputer in each time interval through fitting analysis (fitting inspection with a candidate failure model) of real failure data of the system. On the basis of the failure distribution model, the Mean Time To Failure (MTTF) of the system at different time periods, i.e., the mathematical expectation value of the model, is calculated. The reliability level of HPC at different stages can thus be evaluated, which can further guide system management and fault tolerance optimization.

When the HPC reliability evaluation method based on fault data modeling is adopted, faults affecting reliability are screened through fault big data and fault classification based on HPC real fault data, so that data analysis is optimized, reliability indexes of the system in operation can be truly reflected, the reliability levels of the system in different periods can be analyzed on line, fault tolerance and daily operation and maintenance of the system are guided, and reliability changes of the system in different periods are evaluated through contrastive analysis of expected values of service life distribution in different periods.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A HPC reliability assessment method based on fault data modeling is characterized in that: the method comprises the following steps:

s1, acquiring fault data of all fault units of the target system;

s4, taking the serious fault data obtained in S3 as a total sample, selecting a time interval, taking all the serious fault data in the time interval as sampling samples, and taking elements in the sampling samples according to the serious fault occurrence time component (T)_i) Arranging the samples in a descending order, taking the absolute value of the difference between the time components of adjacent sample elements as a serious fault interval time sample, taking the serious fault interval time sample as input data, and calculating a weibull distribution parameter by adopting a maximum likelihood estimation method to obtain a failure time distribution model of the target system in the time interval, wherein the density function of the weibull distribution is as follows:

wherein m is a shape parameter and η is a characteristic lifetime;

wherein Γ (x) is a Gamma function;

2. The HPC reliability assessment method based on fault data modeling according to claim 1, characterized in that: in S4, the time zone in which the reliability of the system needs to be measured is set as the selected time interval.

3. The HPC reliability assessment method based on fault data modeling according to claim 1, characterized in that: in S4, a time period having a typical characteristic among the time periods in which the reliability of the system is to be measured is set as the selected time interval.

4. The HPC reliability assessment method based on fault data modeling according to claim 3, characterized in that: the typical time periods are time intervals in which parallel applications are running, periods of system operation and maintenance, whole months, whole quarters or whole years.