CN109460356B

CN109460356B - Data fusion method for software fault prediction

Info

Publication number: CN109460356B
Application number: CN201811218891.2A
Authority: CN
Inventors: 王丽; 方建勇; 张贝贝; 杨召; 李吟; 姜婷婷; 张峻玮; 陈善浩; 王凯; 陈丹阳
Original assignee: 716th Research Institute of CSIC
Current assignee: 716th Research Institute of CSIC
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2021-12-28
Anticipated expiration: 2038-10-19
Also published as: CN109460356A

Abstract

The invention discloses a data fusion method for software fault prediction, which comprises the steps of firstly judging the fusion degree of two software fault prediction data sets to be fused according to the inherent data characteristics of the software fault prediction data sets, then extracting the characteristics of fault data, judging the consistency of the software fault prediction data sets by adopting a D-S evidence theory, and then judging whether the data fusion condition is met or not through threshold setting, thereby realizing the fault prediction data fusion. The invention can realize the purpose of fusing different software fault prediction data such as system joint debugging, software three-party testing, system operation, similar software historical data and the like for fault prediction; by the method, the sample capacity of the software fault prediction data can be enlarged, the acquisition time of the software fault prediction data can be shortened, and the software fault prediction precision can be improved.

Description

Data fusion method for software fault prediction

Technical Field

The invention relates to a data fusion technology, in particular to a data fusion method for software fault prediction.

Background

In order to improve the accuracy of software failure prediction, enough failure data acquisition time and enough prediction sample data quantity need to be possessed to ensure the accuracy of software failure prediction.

Considering that a large amount of fault data can be collected in the whole life cycle of the software, the data generally does not contain time labels, but has the same data characteristics such as defect types, severity levels, root causes, occurrence positions, trigger conditions and the like as the fault prediction collected in the running process of the system. If a certain method can be adopted to judge whether the two types of data have the same failure reason, and the data with the same failure reason is fused and eliminated, and the data with different failure reasons is merged into the existing fault prediction data, the aims of enlarging the sample size and shortening the data acquisition time can be achieved, but no method in the prior art can solve the problems.

Disclosure of Invention

The invention aims to provide a data fusion method for software failure prediction.

The technical scheme for realizing the purpose of the invention is as follows: a data fusion method for software failure prediction comprises the following steps:

step 1, defining an identification framework and an identification feature set of software failure prediction data, and introducing a D-S evidence theory into a judgment process of consistency of the software failure prediction data set;

step 2, based on the distribution rule of the software fault prediction data set on the identification features, converting the problem determined by each identification feature credibility function into an F test problem of sample variance in mathematical statistics, and performing evidence synthesis calculation according to a synthesis rule of a D-S evidence theory;

and 3, setting corresponding threshold values according to different fields of the software system, and fusing the data set to be fused into the software failure prediction data set when the fusion degree reaches or exceeds the threshold values to realize the software failure prediction data fusion.

Compared with the prior art, the invention has the following remarkable advantages:

(1) the invention can enrich the sample size of the software failure prediction data on the premise of not prolonging the acquisition time of the failure prediction data when the software runs;

(2) according to the method, the time labels can be added to the data collected in the software full life cycle stage fused into the software failure prediction data set, so that the software failure prediction is effectively supported;

(3) the invention can shorten the data acquisition time of software fault prediction and improve the efficiency of software fault prediction on the premise of equal prediction precision.

The present invention is described in further detail below with reference to the attached drawing figures.

Drawings

FIG. 1 is a data fusion method for software failure prediction according to the present invention.

FIG. 2 is a flow chart of an implementation of the software failure prediction data fusion method based on the D-S evidence theory.

Detailed Description

On the basis of researching a D-S evidence theory principle and a software fault prediction data feature distribution rule, defining software fault prediction data features as identification features in the D-S evidence theory, defining the variance detection significance level of a sample as a credibility function on a single identification feature in the D-S evidence theory, and further applying the D-S evidence theory to realize fusion judgment of a data set; and then, by carrying out slicing processing on the software fault prediction data and extracting and comparing the characteristics of the test case, the purpose of adding a time tag to the fault prediction data at the full life cycle stage of the software fused into the software fault prediction data set is realized.

As shown in fig. 1, a data fusion method for software failure prediction includes the following steps:

step 1, constructing an identification framework and an identification feature set of software fault prediction data, and performing consistency judgment on the software fault prediction data set by using a D-S evidence theory; the specific method comprises the following steps:

step 1-1, defining an identification framework theta, wherein the theta is fuseable or non-fuseable, focal elements are A, the { fuseable }, and B, the { non-fuseable };

step 1-2, defining and identifying feature set

Step 1-3, setting a basic credibility function derived according to data characteristics as m_iWherein at m_iThe upper fusion probability is: m is_i(A)＝a_iAt m_iThe probability of upper non-fusion is: m is_i(B)＝b_iThe probabilities for the other cases are: c. C_i＝1-a_i-b_i。

Step 2, based on the distribution rule of the software fault prediction data set on the identification features, converting the reliability function determination problem of each identification feature into an F test problem of sample variance in mathematical statistics, and performing evidence synthesis calculation according to a synthesis rule of a D-S evidence theory to obtain the fusion degree of the software fault prediction data set; the method specifically comprises the following steps:

step 2-1, aiming at the ith data characteristic Mi, assuming data X_i＝{X_i1,X_i2,...X_imAnd Y_i＝{Y_i1,Y_i2,...，Y_inThe two independent samples from the same population are considered, and the population follows normal distribution, if the assumption is true, the two groups of data should have the same variance value, and then the confidence function on a single feature is determined, and the sample variance is determined when the samples X and Y follow normal distribution

And

satisfies the F distribution, i.e.

Where m and n are the sample capacities of X and Y, respectively, the sample variance is used

And

ratio of

To determine the overall variance of X and Y

And

equal probability r_iI.e. the level of significance of the F test, the fixed order samples X and Y are at feature m_iThe above fusible reliability function is: m is_i(A)＝a_i＝r_i；

Step 2-2, constructing a synthesis rule under the condition that evidences do not conflict, and directly applying a D-S combination formula to carry out pairwise combination calculation under the condition that the evidences do not conflict to obtain:

k＝m_i(A)m_j(B)+m_i(B)m_j(A)

m_ij(A)＝[m_i(A)m_j(Θ)+m_i(Θ)m_j(A)+m_i(A)m_j(A)]/(1-k)

m_ij(B)＝[m_i(B)m_j(Θ)+m_i(Θ)m_j(B)+m_i(B)m_j(B)]/(1-k)

m_ij(Θ)＝[m_i(Θ)m_j(Θ)]/(1-k)

in the formula, m_i(Θ)＝c_i；m_ij(A)、m_ij(B) And m_ij(theta) is an identification feature m_iAnd m_jProbabilities of joint support of credentials A, B and other conditions;

simplifying to obtain:

k＝a_ib_j+b_ia_j

wherein i is more than or equal to 1, and k is more than or equal to j

Thereby obtaining a joint basic credibility function

The calculation formula of (2):

m_ij(A)＝a_ij；m_ij(B)＝b_ij；m_ij(Θ)＝c_ij；

basic credibility function m under action of all data characteristics₀From

Performing two-by-two calculation in sequence, and

a₀＝m₀(A)

obtaining the X and Y fusibility a deduced by all data characteristics₀；

Step 2-3, constructing a synthesis rule under the condition of evidence conflict, and processing the evidence by adopting a weighted average evidence combination method under the condition of evidence conflict, wherein the steps are as follows:

calculating the distance d between evidences and comparing the evidences m_iViewed as vectors, i.e.

m_i＝[m_i(A),m_i(B),m_i(Θ)]

Is provided with

Wherein<m_i,m_j>Represents the inner product of the vector, | m_i||2＝<m_i,m_i>

The distance matrix DM represents the distance between evidences, i.e. pairwise

② calculating pairwise similarity of evidences, defining similarity between evidences as S_ij＝1-d_ij

Obtain a similarity matrix SM

Calculating the credibility of the evidence and setting other evidence pairs m_iDegree of support of

Then evidence m_iReliability of (C) (m)_i) Can be expressed as

Weighted average of evidence

With a confidence level C (m)_i) As m_iIs weighted to obtain

Degree of fusibility of X and Y

a₀＝m(A)。

The present invention will be described in detail with reference to examples.

Examples

With reference to fig. 2, the invention discloses a software failure prediction data fusion method based on a D-S evidence theory, which comprises the following steps:

the first step is as follows: defining an identification framework and an identification feature set of software fault prediction data, and introducing a D-S evidence theory into a judgment process of consistency of the software fault prediction data set;

in engineering practice, software failure prediction data is generally described by attributes such as defect types and root causes, and 4 typical attributes are sorted out by analyzing common attributes of general software failure prediction data and reliability test software failure prediction data: m1 (defect type), M2 (root cause), M3 (severity level), M4 (occurrence location). If the two software failure prediction data are consistent on the 4 attribute descriptions, the two software failure prediction data are considered to be the same software failure prediction data, and the two software failure prediction data are considered to be fusible. Therefore, M1 (defect type), M2 (root cause), M3 (severity level), M4 (occurrence position) are defined as the identification characteristics of the software failure prediction data, and the basic credibility function derived from the data characteristics Mi is defined as M_iWherein m is_i(A)＝a_i，m_i(B)＝b_i，c_i＝1-a_i-b_i。

The second step is that: based on the distribution rule of the software fault prediction data set on the identification features, converting the problem determined by the credibility function on each identification feature into an F test problem of sample variance in mathematical statistics, and performing evidence synthesis calculation according to the synthesis rule of the D-S evidence theory;

according to the statistical analysis of a large number of software failure prediction data samples, the distribution of the software failure prediction data of the software on specific identification features approximately follows normal distribution. Both the general software failure prediction data set X and the reliability software failure prediction data set Y are samples from the population Z. If the sample size of X, Y is large enough, the variances of the two must be equal or approximately equal. However, in general, the sample size of the two samples is not large, so if the data in the two samples are not completely consistent or approximately consistent, the variance of the two samples is necessarily different, and in order to ensure that enough software fault prediction data inconsistent with the data in Y can be selected from the general software fault prediction data set X, the threshold value of the fusibility degree is set to be 0.95.

First, assuming that the confidence value of the sample X, Y is m for each identified feature calculated by the confidence function mentioned above₁(A)＝r₁，m₂(A)＝r₂，m₃(A)＝r₃，m₄(A)＝r₄；

Secondly, according to a calculation formula:

or a is obtained by a weighted average evidence combination method₀；

Thirdly, setting corresponding threshold values according to different fields of the software system, and determining the degree of fusion a₀When the threshold value is reached or exceeded, the data in the software failure prediction data set X and the data in the software failure prediction data set Y are considered to be approximately consistent, and the software failure prediction data in the X do not help the expansion sample size; otherwise, the data in the X can be added into the Y, so that the subsequent reliability test and evaluation are facilitated.

The third step: the data set to be fused is fused into the software failure prediction data set, so that the software failure prediction data fusion is realized, and the purposes of expanding the sample capacity of the software failure prediction data, shortening the acquisition time of the software failure prediction data and improving the software failure prediction precision are achieved.

Claims

1. A data fusion method for software failure prediction is characterized by comprising the following steps:

step 1, constructing an identification framework and an identification feature set of software fault prediction data, and performing consistency judgment on the software fault prediction data set by using a D-S evidence theory;

step 2, based on the distribution rule of the software fault prediction data set on the identification features, converting the reliability function determination problem of each identification feature into an F test problem of sample variance in mathematical statistics, and performing evidence synthesis calculation according to a synthesis rule of a D-S evidence theory to obtain the fusion degree of the software fault prediction data set;

2. The data fusion method for software failure prediction according to claim 1, wherein step 1 specifically comprises:

step 1-2, defining and identifying feature set

Step 1-3, setting a basic credibility function derived according to data characteristics as m_i(. wherein at m)_iThe probability of fusion on (cna) is: m is_i(A)＝a_iAt m_iThe probability of unfusibility on (a) is: m is_i(B)＝b_iThe probabilities for the other cases are: c. C_i＝1-a_i-b_iAnd i represents the order.

3. The data fusion method for software failure prediction according to claim 2, wherein the step 2 is to convert the reliability function determination problem of each recognition feature into an F-test problem of sample variance in mathematical statistics based on the distribution rule of the software failure prediction data set on the recognition features, and perform evidence synthesis calculation according to the synthesis rule of the D-S evidence theory, specifically:

step 2-1, aiming at the ith basic credibility function as m_i(. to) assume data X_i＝{X_i1,X_i2,...,X_ipAnd Y_i＝{Y_i1,Y_i2,...,Y_iqThe two groups of data should have the same variance value if the assumption is true, and then determine the reliability function on a single feature, which is known from mathematical statistics, when the sample X is_iAnd Y_iAll obey normal distribution, their sample variance

And

satisfies the F distribution, i.e.

Wherein p and q are each X_i、Y_iThe sample volume of (2) is then the sample variance

And