CN105653450A

CN105653450A - Software defect data feature selection method based on combination of modified genetic algorithm and Adaboost

Info

Publication number: CN105653450A
Application number: CN201511003717.2A
Authority: CN
Inventors: 李克文; 邹晶杰
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2015-12-28
Filing date: 2015-12-28
Publication date: 2016-06-08

Abstract

The present invention is mainly applied to the field of software engineering, and targeted at a randomness problem of the existing software defect data feature selection method, provides a software defect data feature selection method based on combination of a modified genetic algorithm and Adaboost. The main steps are as follows: firstly, acquiring software module data from a software data set intensively, and performing labeling processing on the data; then dividing a feature space, and constructing a feature selection classifier based on Adaboost, wherein each feature sub-space corresponds to one weak classifier; optimizing Adaboost by using a frequency idea based genetic algorithm: performing screening of an optimal feature by combining a selected frequency of a feature; and finally, performing a test on the data set according to an acquired optimal feature subset, and comparing the method with other feature selection methods, so as to verify stability and accuracy of the method, and establishing a software defect prediction model. The method provided by the present invention is capable of relatively well overcoming the randomness problem in the software defect data feature selection process and has relatively good stability and relatively high accuracy.

Description

Based on the software defect data characteristics system of selection that improved adaptive GA-IAGA is combined with Adaboost

Technical field

The invention belongs to field of software engineering, be specifically related to a kind of software defect data characteristics system of selection being combined with Adaboost based on improved adaptive GA-IAGA.

Background technology

Rapid development of information technology, data mass memory, all trades and professions are in the urgent need to being converted into knowledge by data. From data, find that valuable information becomes the focus of current theory and application research. Statistics, data base, machine learning and visualization technique have been merged in data mining, by analysis of history data, it has been found that unknown and novel teachings, provide a kind of effective way for solving information age " data explosion, poor in information " problem. Feature selection (Featureselection) and classification (Classification) are modal tasks in DM, by feature selection by high dimensional data dimensionality reduction, forecast model is set up, it is possible to provide the Accurate Prediction to unknown problem by classifying.

Along with the enhancing day by day of increase and the logic complexity thereof day by day of software system scale, the defect having in software certainly will threaten the reliability of software, affects software quality. Due to the industry-by-industry strong depend-ence to software system, software fault may result in serious consequence, especially for high risk sexual system or even fatal. Software defect Predicting Technique is as a kind of important approach instructed with assessment software test job, and Experience of Software Testing Work is had important directive significance by the distribution situation of forecasting software defect exactly. For a software system, rational prediction defect can add up the defect counts and defect distribution that not yet find but still suffer from. Do so not only can instruct in the module that limited energy and resource input are easily made mistakes by developer to those, saves substantial amounts of human cost and resource; But also can objective appraisal test result, this to software quality, development cost and construction cycle control suffer from great meaning.

That software defect is predicted it is crucial that find there is abnormal module, this actually can regard two classification problems as, is divided into " normally " and two classifications of "abnormal" by software module.The premise of classification is by feature selection, and the optimal feature subset according to selecting is classified. But in reality, the feature selection process of software defect data faces problems:

(1) there is uncorrelated in a large number and redundancy software features

In each class software features, except basic feature is directly to extract from source code, other feature is all calculated by these basic feature values. Thus can obtaining, the dependency between same class software features is relatively big, there is more redundancy feature. Substantial amounts of redundancy or uncorrelated features participate in computing, it will bring dimension disaster, admittedly feature set need to be carried out dimension-reduction treatment, select optimal feature subset.

(2) result is selected to have randomness

In existing Feature Selection, single grader is adopted to classify, and utilizing classification results that feature selection result is passed judgment on, the quality of feature only comes from single grader classification results so that run the shortcoming that the optimal feature subset obtained has randomness.

To sum up, under the situation that current internet software product develops rapidly, proposing efficient software defect forecast model, software carries out correct abnormality detection is problem in the urgent need to address, and its premise selects optimal feature subset, is the basis and key of instructing abnormality detecting process accurately to carry out.

Summary of the invention

Present invention aim to address traditional stochastic problems existing for software defect data characteristics system of selection, it is provided that a kind of software defect data characteristics system of selection being combined based on improved adaptive GA-IAGA with Adaboost, to improve the stability of feature selection.

For achieving the above object, technical solution of the present invention mainly includes three below step:

A. the feature selection grader based on Adaboost is built

(1) concentrate acquisition software module data from software data, divide training set and test set.

(2) feature set is divided into T the proper subspace being sized to N, every sub spaces correspondence body one by one, feature is carried out binary coding, be mapped as 0-1 string. 0 represents and does not select this feature, and 1 represents and selects this feature, and each string be body one by one, T individuality one population of composition.

(3) sample weights initializes:

D₁(i)=1/m formula (1)

Wherein, m is sample number, D₁I () takes turns the weight in iteration for sample the 1st.

(4) it is circulated t=1,2 ..., T:

A. use the t sample distribution taken turns, train Weak Classifier h_t, h_tQuality by False Rate ��_tWeigh, ��_tIt is all by the sample weights sum of misclassification:

��_t=�� D_t(i)I[h_t(x_i)��y_i] formula (2)

Wherein D_tI () is sample (x_i,y_i) take turns the weight in iteration, I [h at t_t(x_i)��y_i] illustrate to participate in False Rate ��_tCalculate by the sample of misclassification.

B. the weight �� of then each Weak Classifier_tCan be used to weigh the importance of Weak Classifier.

α_{t} = \frac{1}{2} l n (\frac{1 - ϵ_{t}}{ϵ_{t}})

Formula (3)

C. sample weights is updated:

D_{t + 1} (i) = \frac{D_{t} (i)}{Z_{t}} \times \{\begin{matrix} e^{- α_{t}}, & h_{t} (x_{i}) = y_{i} \\ e^{α_{t}}, & h_{t} (x_{i}) &NotEqual; y_{i} \end{matrix}

Formula (4)

Wherein, Z_tFor normalization factor, namely

Z_{t} = Σ D_{t} (i) e^{(- α_{t} y_{i}) h_{t} (x_{i})}

Formula (5)

D. finally giving strong classifier is

H (x) = s i g n [Σ_{t = 1}^{T} α_{t} h_{t} (x)]

Formula (6)

B. genetic algorithm is adopted to be optimized

Adopt the genetic algorithm based on frequency thought that Adaboost is optimized: to carry out the screening of optimal characteristics in conjunction with feature selected frequency.

(1) select

Adopt the roulette selection with elite retention strategy to operate, optimum individual in population is directly selected into the next generation, then carries out roulette operation. In process, evaluate individual good and bad according to fitness function. In this optimization method, define fitness function by the Adaboost integrated detection recall rate (Recall) obtained and rate of false alarm (pf).Fitness function is:

f (x) = \sqrt{Re c a l l} \times \sqrt{p f} = \sqrt{\frac{A}{A + B}} \times \sqrt{\frac{C}{C + D}}

Formula (7)

Wherein, recall rate (Recall), it is defined as the ratio being correctly predicted to be defective number of modules with true defective number of modules, is expressed as follows

Re c a l l = \frac{A}{A + B}

Formula (8)

Rate of false alarm (falsepositiverate, pf), also referred to as false sun rate. It is defined as the ratio that error prediction is defective module number and actual zero defect number of modules, is represented by

p f = \frac{C}{C + D}

Formula (9)

It is as shown in the table in A, B, C, D definition,

Table 1 classification predicts the outcome

	Predict defective	Prediction zero defect
			Truly defective	A	B
True zero defect	C	D

(2) intersect, make a variation

Single-point crossover operator and single-point mutation operator is adopted to carry out intersection and the mutation operation of genetic algorithm.

(3) frequency screening

In the optimal solution that foundation single genetic algorithm optimization Adaboost obtains, the frequency that feature occurs, reconfigure and obtain optimal characteristics combination. Make F=(f₁,f₂��f_p) represent and repeatedly run the optimal solution set that genetic algorithm obtains, f_iRepresenting i-th optimal solution, p is the number of times that algorithm runs.

Formula (10)

Then, available formula (11) calculates jth feature selected frequency q_j��

q_{j} = \frac{Σ_{i = 1}^{p} f_{i} (j)}{p}

Formula (11)

When feature j selected frequency is less than certain threshold value, then this feature can not be selected into final character subset, otherwise, then it is added into final character subset. This threshold value obtains by experiment.

C. the foundation of feature selection module and test

By said process, carry out final screening according to feature selected frequency, thus obtaining optimal characteristics combination. Optimal feature subset according to obtaining is tested on software module data set, contrasts with other feature selection approach, verifies its stability in software defect prediction and accuracy rate, thus setting up software defect forecast model.

Accompanying drawing explanation

Fig. 1 is based on the software defect data characteristics system of selection flow chart that improved adaptive GA-IAGA is combined with Adaboost.

Detailed description of the invention

Below in conjunction with Fig. 1, the present invention is described in further detail.

The first step: build the feature selection grader based on Adaboost

(1) acquisition software module data are concentrated from software data, including software features collection, software module data. And software module data are divided into training set and test set in order to training and test. Do tag processes: software module data set { X, Y}, X={x₁,x₂��x_m, Y={y₁,y₂}={+1 ,-1}. If software module x_iZero defect, then (x_i,y_i)=(x_i,-1), otherwise, (x_i,y_i)=(x_i,+1)��

(3) sample weights initializes:

D₁(i)=1/m

Wherein, D₁I () is sample (x_i,y_i) take turns the weight in iteration the 1st.

(4) it is circulated t=1,2 ..., T:

ϵ_{t} = Σ D_{t} (i) I [h_{t} (x_{i}) &NotEqual; y_{i}]

α_{t} = \frac{1}{2} l n (\frac{1 - ϵ_{t}}{ϵ_{t}})

C. sample weights is updated:

D_{t + 1} (i) = \frac{D_{t} (i)}{Z_{t}} \times \{\begin{matrix} e^{- α_{t}}, & h_{t} (x_{i}) = y_{i} \\ e^{α_{t}}, & h_{t} (x_{i}) &NotEqual; y_{i} \end{matrix}

Wherein, Z_tFor normalization factor, namely

Z_{t} = Σ D_{t} (i) e^{(- α_{t} y_{i}) h_{t} (x_{i})}

D. finally giving strong classifier is:

H (x) = s i g n [Σ_{t = 1}^{T} α_{t} h_{t} (x)]

Second step: adopt genetic algorithm to be optimized

The result that second step Adaboost strong classifier is obtained by the genetic algorithm based on frequency thought is adopted to be optimized: to carry out the screening of optimal characteristics in conjunction with feature selected frequency.

(1) select

Adopt the roulette selection with elite retention strategy to operate, optimum individual in population is directly selected into the next generation, then carries out roulette operation. In process, evaluate individual good and bad according to fitness function. In this optimization method, define fitness function by the Adaboost integrated detection recall rate (Recall) obtained and rate of false alarm (pf). Fitness function is:

f (x) = \sqrt{Re c a l l} \times \sqrt{p f} = \sqrt{\frac{A}{A + B}} \times \sqrt{\frac{C}{C + D}}

Re c a l l = \frac{A}{A + B}

p f = \frac{C}{C + D}

It is as shown in the table in A, B, C, D definition,

Table 1 classification predicts the outcome

(2) intersect, make a variation

(3) frequency screening

In the optimal solution that foundation single genetic algorithm optimization Adaboost obtains, the frequency that feature occurs, reconfigure and obtain optimal characteristics combination. Make F=(f₁,f₂��f_p) represent and repeatedly run the optimal solution set that genetic algorithm obtains, f_iRepresenting i-th optimal solution, p is the number of times that algorithm runs, it is stipulated that function is as follows

q_{j} = \frac{Σ_{i = 1}^{p} f_{i} (j)}{p}

When feature j selected frequency is less than certain threshold value, then this feature can not be selected into final character subset, otherwise, then add final character subset. This threshold value obtains by experiment.

3rd step: by said process, carry out final screening according to feature selected frequency, thus obtaining optimal characteristics combination. Optimal feature subset according to obtaining is tested on software module data set, contrasts with other feature selection approach, verifies its stability in software defect prediction and accuracy rate.

The present invention can carry out higher-dimension software features dimensionality reduction, thus guiding software failure prediction, and there is good stability and higher accuracy rate.

The invention provides a kind of software defect data characteristics system of selection being combined based on improved adaptive GA-IAGA with Adaboost; should be understood that; for those skilled in the art; under the premise without departing from the principles of the invention; can also making some improvement, these improvement also should be regarded as protection scope of the present invention. Each ingredient not clear and definite in the present embodiment is used equally to prior art and is realized.

Claims

1. the software defect data characteristics system of selection being combined with Adaboost based on improved adaptive GA-IAGA, it is characterised in that mainly include three below step:

A. the feature selection grader based on Adaboost is built

(1) acquisition software module data are concentrated from software data, including software features collection, software module data; And software module data are divided into training set and test set in order to training and test; Do tag processes: software module data set { X, Y}, X={x₁, x₂...x_m, Y={y₁, y₂}={+1 ,-1}; If software module x_iZero defect, then (x_i, y_i)=(x_i,-1), otherwise, (x_i, y_i)=(x_i,+1);

(2) feature set is divided into T the proper subspace being sized to N, every sub spaces correspondence body one by one, feature is carried out binary coding, be mapped as 0-1 string; 0 represents and does not select this feature, and 1 represents and selects this feature, and each string be body one by one, T individuality one population of composition;

(3) sample weights initializes:

D₁(i)=1/m

Wherein, m is sample number, D₁I () is sample (x_i, y_i) take turns the weight in iteration the 1st;

(4) it is circulated t=1,2 ..., T:

ϵ_{t} = Σ D_{t} (i) I [h_{t} (x_{i}) &NotEqual; y_{i}]

Wherein D_tI () is sample (x_i, y_i) take turns the weight in iteration, I [h at t_t(x_i)��y_i] illustrate to participate in False Rate ��_tCalculate by the sample of misclassification;

B. the weight �� of then each Weak Classifier_tCan be used to weigh the importance of Weak Classifier;

α_{t} = \frac{1}{2} \ln (\frac{1 - ϵ_{t}}{ϵ_{t}})

C. sample weights is updated:

D_{t + 1} (i) = \frac{D_{t} (i)}{Z_{t}} \times \{\begin{matrix} e^{- α_{t}}, & h_{t} (x_{i}) = y_{i} \\ e^{α_{t}}, & h_{t} (x_{i}) &NotEqual; y_{i} \end{matrix}

Wherein, Z_tFor normalization factor, namely

Z_{t} = Σ D_{t} (i) e^{(- α_{t} y_{i}) h_{t} (x_{i})}

D. finally giving strong classifier is

H (x) = sign [Σ_{t = 1}^{T} α_{t} h_{t} (x)]

B. genetic algorithm is adopted to be optimized

Adopt the genetic algorithm based on frequency thought that Adaboost is optimized: to carry out the screening of optimal characteristics in conjunction with feature selected frequency;

(1) select

Adopt the roulette selection with elite retention strategy to operate, optimum individual in population is directly selected into the next generation, then carries out roulette operation; In process, evaluate individual good and bad according to fitness function; In this optimization method, defining fitness function by the Adaboost integrated detection recall rate (Recall) obtained and rate of false alarm (pf), fitness function is:

f (x) = \sqrt{Recall} \times \sqrt{pf} = \sqrt{\frac{A}{A + B}} \times \sqrt{\frac{C}{C + D}}

Recall = \frac{A}{A + B}

Rate of false alarm (falsepositiverate, pf), also referred to as false sun rate; It is defined as the ratio that error prediction is defective module number and actual zero defect number of modules, is represented by

pf = \frac{C}{C + D}

It is as shown in the table in A, B, C, D definition,

Table 1 classification predicts the outcome

(2) intersect, make a variation

Single-point crossover operator and single-point mutation operator is adopted to carry out intersection and the mutation operation of genetic algorithm;

(3) frequency screening

In the optimal solution that foundation single genetic algorithm optimization Adaboost obtains, the frequency that feature occurs, reconfigure and obtain optimal characteristics combination; Make F=(f₁, f₂...f_p) represent and repeatedly run the optimal solution set that genetic algorithm obtains, f_iRepresenting i-th optimal solution, p is the number of times that algorithm runs, it is stipulated that function is as follows

Then, available formulaCalculate jth feature selected frequency q_j; When feature j selected frequency is less than certain threshold value, then this feature can not be selected into final character subset, otherwise, then it is added into final character subset; This threshold value obtains by experiment;

C. the foundation of feature selection module and test

By said process, carry out final screening according to feature selected frequency, thus obtaining optimal characteristics combination; Optimal feature subset according to obtaining is tested on software module data set, contrasts with other feature selection approach, verifies its stability in software defect prediction and accuracy rate, thus setting up software defect forecast model.